[PDF] Purifying selection, drift and reversible mutation with arbitrarily high mutation rates

Abstract

Full PDF

1 Purifying selection, drift and reversible mutation with arbitrarily high mutation rates

Authors:

Brian Charlesworth * and Kavita Jain § Affiliation: * Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, United Kingdom § Theoretical Sciences Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Jakkur P.O., Bangalore 560064, India Running head : Population genetics of hyperdiversity

Keywords : Hyperdiversity, mutation rate, epigenetic variation, site frequency spectrum, purifying selection, rate of substitution

Corresponding Author : Brian Charlesworth Institute of Evolutionary Biology, School of Biological Sciences University of Edinburgh Ashworth Laboratories King’s Buildings Edinburgh EH9 3JT, UK

Telephone : +44 (0)131 650 5751

Fax : +44 (0)131 650 6564

Email : [email protected] 3

ABSTRACT

Some species exhibit very high levels of DNA sequence variability; there is also evidence for the existence of heritable epigenetic variants that experience state changes at a much higher rate than sequence variants. In both cases, the resulting high diversity levels within a population (hyperdiversity) mean that standard population genetics methods are not trustworthy. We analyze a population genetics model that incorporates purifying selection, reversible mutations and genetic drift, assuming a stationary population size. We derive analytical results for both population parameters and sample statistics, and discuss their implications for studies of natural genetic and epigenetic variation. In particular, we find that (1) many more intermediate frequency variants are expected than under standard models, even with moderately strong purifying selection (2) rates of evolution under purifying selection may be close to, or even exceed, neutral rates. These findings are related to empirical studies of sequence and epigenetic variation. 4 The infinite sites model, originally proposed by Fisher (1922,1930) and developed in detail by Kimura (1971), has been the workhorse of molecular population genetics for four decades. Its core assumption is that any nucleotide site segregates for at most two variants, and that the mutation rate scaled by effective population size ( N e ) is so low that new mutations arise only at sites that are fixed within the population. This assumption facilitates calculations of the theoretical values of some key observable quantities, such as the expected level of pairwise nucleotide site diversity or the expected number of segregating sites in a sample (Kimura 1971; Watterson 1975; Ewens 2004). In the framework of coalescent theory, this implies a linear relation between the genealogical distance between two sequences and the neutral sequence divergence between them, greatly simplifying methods of inference and statistical testing (Hudson 1990; Wakeley 2008). There has recently been some discussion of how to go beyond the infinite sites assumption of a low scaled mutation rate, which breaks down for species with very large effective population sizes, including some species of virus and bacteria, and even eukaryotes such as the sea squirt and outbreeding nematode worms, resulting in “hyperdiversity” of DNA sequence variability within a population (Cutter et al. u per site in a population of N breeding adults, the expected fraction of sites that are segregating in a randomly mating population is f s = θ [ln (2 N ) + 0.6775], where θ = 4 N e u (Ewens 2004, p.298). Thus, with θ = 0.01, a reasonable value for many species (Leffler et al. f s = 0.15 even when N has the implausibly low value of one million. This implies that about 15% of new mutations are expected to arise at sites that are already segregating, suggesting a significant departure from the assumptions of the infinite sites model. (An alternative way of looking at this is to determine the expected number of new mutations that occur at a site while a pre-existing mutation is segregating, which is of a similar magnitude to f s – see Appendix, equation (A1).) In addition, it has been known for nearly twenty years that sufficiently high scaled mutation rates at some or all sites in a sequence can lead to substantial departures from the infinite sites expectations for statistics such as Tajima’s D , which are commonly used to detect deviations from neutral equilibrium caused by population size changes or selection (Bertorelle and Slatkin 1995; Aris-Brosou and Excoffier 1996; Tajima 1996; Yang 1996; Mizawa and Tajima 1997). This is because the occurrence of mutations at sites that are already segregating increases the pairwise diversity among sequences, but does not increase the number of segregating sites (Bertorelle and Slatkin 1995). The analysis of data on DNA sequence variation in hyperdiverse species thus requires methods that deal with this problem, and a number of population genetics models that contribute to this have already been developed (Desai and Plotkin 2008; Jenkins and Song 2011; Cutter et al. et al. et al. et al. et al. et al. et al. et al. et al. Analysis of the Model of Purifying Selection, Drift and Mutation

Basic assumptions

In order to generate simple analytical results, we use a “finite-sites” model that is an extension of the infinite sites model previously used for studies of codon usage bias (McVean and Charlesworth 1999). We assume a randomly mating, diploid, discrete generation population with N breeding adults, and effective population size N e . Over a long sequence of m nucleotide sites, each site has two alternative types, A and A , with mutation rates u and v from A to A and vice versa . A and A might correspond to AT versus GC base pairs, unpreferred versus preferred synonymous codons, or selectively favored versus disfavored nonsynonymous variants. If epigenetic variation is being considered, then A and A could be regarded as the methylated or unmethylated states of a nucleotide site or a differentially methylated region (or vice-versa). This approach, while undoubtedly oversimplified, avoids the problem of modeling mutation among all four basepairs, which is difficult to deal with except by making the unrealistic assumption of equal mutation rates in all directions (Ewens 2004, p.195). If selection is acting, we assume semidominance, with A having a selective advantage s over A when homozygous, although our general conclusions are probably not strongly dependent on this assumption. There is complete independence among sites (i.e. recombination is sufficiently frequent that linkage disequilibrium is negligible), and 8 all evolutionary forces are weak, so that the standard results of diffusion approximations can be employed. If the population is at statistical equilibrium, the mean numbers of sites in each possible state are constant over time, despite continual changes at individual sites. At any given time, some sites are fixed for the A type, some for A , and others segregate for both. Let the equilibrium proportion of sites that are fixed for A and A be f f and f f , respectively. The proportion of sites that are segregating is f s = 1 – f f – f f . Results for some important population parameters

These assumptions allow the use of Wright’s stationary distribution formula (Wright 1931, 1937) to describe the probability density of the frequency q of A at a site φ ( q ) = C exp( γ q ) p α− q β− (1) where p = 1 – q , α = 4 N e u , β = 4 N e v , γ = 2 N e s, and the constant C is such that the integral of φ ( q ) between q = 0 and q = 1 is equal to 1. It is convenient to write u in terms of the mutational bias parameter, κ , i.e. u = κ v , so that α = κβ. An explicit expression for C can be obtained by noting that the integral of the other terms on the right-hand side with respect to q is equal to the product of Γ ( α ) Γ ( β )/ Γ ( α + β ) and the confluent hypergeometric function F ( a , b , z ) (Abramowitz and Stegun 1965, p.503), with parameters a = β , b = α + β , z = γ , where 9 F ( a , b , z ) = z i i ! i = ∞ ∑ ( a ) i ( b ) i (2) where ( x ) =

1, ( x ) i = x (1 + x )(2 + x )...( i − + x ) for i ≥ (Pochhammer’s symbol). This can be seen by expanding the exponential term in equation (1) in powers of γ q , and integrating over the range 0 to 1 (Kimura et al. C = Γ ( α+ β ) Γ ( α ) Γ ( β ) 1 F ( β , α+ β , γ ) (3) Furthermore, the j th moment of q around zero, obtained from the integral of q j φ ( q ) between 0 and 1, is given by M j ( q ) = F ( β+ j , α+ β+ j , γ ) ( β ) j F ( β , α+ β , γ ) ( α+ β ) j (4) In particular, the mean frequency of A is q = + κ ) F ( β+ α+ β+ γ ) F ( β , α+ β , γ ) (5a) and the mean frequency of A is 10 p = κ (1 + κ ) F ( β , α+ β+ γ ) F ( β , α+ β , γ ) (5b) Approximations to these expressions for the case when β is << 1 and κ is of order 1 are derived in the Appendix. Equations (A3) imply that q = + κ exp( −γ )] + O ( β ) (6) The left-hand side of equation (6) is equivalent to the fraction of sites that carry A in a random sequence sampled from the population; if A and A correspond to unpreferred and preferred codons, respectively, this measures the frequency of preferred codons, Fop (McVean and Charlesworth 1999). With epigenetic variation, if A and A correspond to methylated and unmethylated states, q measures the fraction of unmethylated sites or regions in a random genome. The leading term on the right-hand side of equation (6) is identical to the Li-Bulmer equation commonly used in analyses of selection on codon usage (Li 1987; Bulmer 1991). This result is, however, often derived by assuming that nearly all sites are fixed, and calculating the rate of flux between sites fixed for A and A ; q is then taken to be the frequency of sites that are fixed for A , with 1 – q representing the frequency of sites fixed for A (Bulmer 1991). This raises the question of how good an approximation we obtain by neglecting the term of order β , when the infinite sites assumption is violated, so that a significant fraction of sites are in fact segregating for A and A . 11 First, we note that it is immediately obvious from (1) and equations (5) that q with γ = 0 is equal to 1/(1 + κ ), so that equation (6) for this case is exact, as has long been known (Wright 1931). We can also obtain a first-order approximation to equations (A3) when γ ≠ β , which will be accurate when β is sufficiently small. Neglecting second-order and higher terms in β , as will also be done in equations (8), we obtain q ≈ − βκ g exp( −γ )[1 + κ exp(– γ )] (7a) where g = [ γ+ κ exp( −γ ) γ i i ! a i + + γ i i ! ( a i + − a i )] i = ∞ ∑ i = ∞ ∑ [1 + κ exp( −γ )] (7b) and a i is the harmonic series 1 + 1/2 + 1/3 + … + 1/( i –1), with i ≥

2. Since a i +1 < i for i > 1, the first summation in the numerator of g is less than the sum of γ i ( i –1)!, so that the sum is < γ exp( γ ). Similarly, a i +1 – a i = 1/ i , so that the second summation is < exp( γ ) – (1 + γ ). It follows that g is positive and < κ γ + exp ( γ ) – 1. This is multiplied by exp(– γ ) in the numerator of equation (5a), to obtain the multiplicand of βκ , yielding κ γ exp(– γ ) + 1 – exp(– γ ) < κ + 1 – exp(– γ ). The contribution of – βκ g exp(– γ ) to the numerator of equation (7a) is thus negative and smaller in magnitude than βκ (1 + κ) , so that the leading term in equation (6) should provide a good approximation when βκ is around 0.1 or less. 12 For examining what happens when γ becomes very large, it is useful to note that the Taylor’s series expansion of equation (5b) for small β yields the expression p ≈ κ exp( −γ )1 + κ exp( −γ ) { + β [ γ i i ! i = ∞ ∑ + κ exp( −γ )1 + κ exp( −γ ) γ i + ln( i )( i + i = ∞ ∑ ] } (8a) For large γ , this gives p ≈ κ exp( −γ )1 + κ exp( −γ ) [ + β exp( γ ) γ ] (8b) i.e. p ≈ βκγ [1 + O ( γ − )] (8c) The first term on the right-hand site of equation (8c) is equivalent to the asymptotic expression for p with large γ given by Kimura et al. (1963). This implies that, for sufficiently large γ compared with β , the mean frequency of the disfavored variant is equal to its equilibrium frequency under mutation-selection balance with s >> u in an infinite population, where p = 2 u / s = 2 v κ / s (Haldane 1927) , as expected intuitively. Numerical studies show that equation (8b) performs well for γ > 1 when β << 1, when it can give a good approximation when neither the leading term in equation (6) nor the Kimura et al. (1963) large γ approximation are accurate (results not shown). Equation (8b) implies that the leading term in equation (6) is accurate when γ << – ln( β ). Second, the approximate frequencies of sites that are fixed for A and A , f f and f f , can be found from the integrals of φ ( q ) between 0 and 1/(2 N ) and 1– 1/(2 N ) and 1, 13 respectively (Ewens 2004, p.178). For large N , such that γ N –1 << 1, when q is close to zero we have φ ( q ) = C [ q β – 1 + O ( γ N –1 ) + O ( β N –1 )], and so f f ≈ φ ( q ) d q N ) ∫ = C [ β − (2 N ) −β + O ( γ N − ) + O ( β N − )] (9a) f f ≈ φ ( q ) d q − N )1 ∫ = C exp( γ )[( βκ ) − (2 N ) −βκ + O ( γ N − ) + O ( β N − )] (9b) where the terms in O ( γ N –1 ) and O ( β N –1 ) can be neglected when N is sufficiently large (c.f. (Kimura 1981)). Approximations for these expressions for small β can readily be obtained (see Appendix). Third, the expected pairwise nucleotide site diversity, π , can be obtained from the expectation of 2 pq = 2E{ q – q } between 0 and 1. From equation (4), we have E{ q } = β ( β+ α+ β )( α+ β+ F ( β+ α+ β+ γ ) F ( β , α+ β , γ ) (11a) The expectation of E{ q – q } is given by subtracting equation (11a) from equation (5b). Using equation (2) and simplifying, we obtain E{ pq } = α ( α+ β+ + γ i i ! [ ( β+ i ( α+ β+ i − ( β+ i + ( α+ β+ i + ] i = ∞ ∑ [ + κ+ γ+ γ i ( β+ i − i ! ( α+ β+ i − ] j = ∞ ∑ (11b) 14 This is equal to one-half of the expected pairwise diversity per site, π . Using the same approach as for equations (7) and (8), keeping only terms of order β we obtain π≈ βκγ [1 − exp( −γ )][1 + κ exp( −γ )] (12) As expected, this is identical to equation (15) of McVean and Charlesworth (1999) for the infinite sites model at statistical equilibrium, where new mutations arise only at sites that are fixed either for A or for A . When γ >> β , this term converges on the deterministic value under mutation-selection balance, 2 βκ / γ , which corresponds to the diversity expected at deterministic mutation-selection balance with p = 2 u / s (see above). In the case of neutrality, equation (11b) reduces to the following expression (Charlesworth and Charlesworth 2010, p.237) π= βκ (1 + κ )[ β (1 + κ ) + (13) As expected intuitively, the neutral diversity is always less than for the infinite sites model with a given value of β and κ , where equation (12) with γ = 0 gives π = 2 βκ /(1 + κ ) , because some new mutations arise at sites that are already polymorphic; π approaches 2 κ /(1 + κ ) for large β , which is the value for an infinite population at equilibrium under reversible mutation between A and A . 15 Rate of substitution along a lineage

The rate of substitution of new mutations along a lineage can be modeled as follows. Conditioning on a frequency q of the A variant at a site in a given generation, there is an expected number of 2 Nv κ q mutations per site from A to A , and 2 Nvp from A to A . The corresponding probability that the deleterious variant A eventually becomes fixed, conditional on p , is Q ( p ) = [exp( γ p ) – 1]/[exp( γ ) – 1] (Kimura 1962). Conditioning on this event, the probability that a new A mutation has been fixed is 1/(2 Np ). The expected number of new A mutations that become fixed is thus equal to v κ p –1 q Q ( p ). Similarly, the conditional probability that A eventually becomes fixed is Q ( q ) = [1 – exp(– γ q )]/[1 – exp(– γ )]; the net expected number of new A mutations that become fixed is v pq –1 Q ( q ). (At first sight, it would seem that this procedure cannot be applied to mutations arising in the fixed classes, and that these should be treated separately, but the argument given in the Appendix shows that it provides an accurate approximation for the situation as well.) Integrating over all values of q , the net rate at which new mutations enter the population and become fixed is thus λ= v [ κ p − qQ ( p ) + pq − ∫ Q ( q )] φ ( q ) d q (14) The terms involving functions of p and q in the integrand are 16 p − qQ ( p ) φ ( q ) = C [exp( γ p ) − γ ) −

1] exp( γ q ) p α − q β (15a) pq − Q ( q ) φ ( q ) = C [1 − exp( −γ q )][1 − exp( −γ )] exp( γ q ) p α q β − (15b) The corresponding integrals are I = { exp( γ ) − F ( β+ α+ β , γ ) F ( β , α+ β , γ ) } β ( α− γ ) − (15c) and I = { F ( β− α+ β , γ ) − F ( β , α+ β , γ ) } α ( β− − exp( −γ )] (15d) Note that F ( β – 1 , α + β , γ ) – 1 has a factor of β – 1, so that the term in β – 1 in the denominator of equation (15d) cancels. At first sight, equation (15c) appears to have a singularity at α = 1. However, by using the relation F ( a , b , z ) = F ( b – a , b , z ) exp( z ), we find that F ( β + 1 , α + β , γ ) = F ( α – 1 , α + β , γ ) exp( γ ), so that the numerator of equation (15c) contains a factor of α – 1, which cancels the term in the denominator. The net rate of substitution is given by λ = v ( κ I + I ) (16) 17 As γ approaches zero, equations (15) and (16) imply that λ tends to 2 v κ /(1 + κ ); this is independent of the population size and is identical to the infinite sites expression with neutrality at statistical equilibrium under reverse mutation (Charlesworth and Charlesworth 2010, p.274), as expected from the fact that the equilibrium neutral substitution rate is equal to the net mutation rate for any class of mutational model (Kimura 1968). When α and β are sufficiently small, the main contributions to λ come from the two fixed classes, so that the initial frequencies of the new mutations can be equated to 1/(2 N ), when o ( β ) terms in I and I are neglected. Using the above result that the frequencies of the fixed classes are equal to the infinite sites values multiplied by a factor 1 – O ( β ), the infinite sites expression is for the case of selection is recovered, neglecting higher order terms in β (Equation 6.11 of (Charlesworth and Charlesworth 2010, p.275) ). Again, this implies that, as expected, the infinite sites model provides a good approximation for the rate of substitution with sufficiently small β . There are two different ways in which we can determine the ratio of the value of λ with γ > 0 to that for a neutral standard, thereby removing the dependence on the mutation rate term in equation (16). First, λ with selection can be compared to its value at statistical equilibrium with the same value of α and β . This would be appropriate for comparing rates of evolution at putatively neutral sites in a given genomic region with sites that are potentially under purifying selection, without making any corrections for differences in base composition; this is often done when comparing nonsynonymous and synonymous rates of substitution across different genes by statistics such as K A /K S . Second, λ with selection can be compared with the neutral rate conditioned on the same 18 mean frequencies of A and A along the sequence as for the selected sites; this corresponds to methods that compare probabilities of substitution between the same pairs of nucleotides in contexts when these are putatively selected versus putatively neutral (Halligan et al. et al. Numerical results for the population parameters

Numerical results generated from the above formulae are presented in Figures 1 and 2. Figure 1 illustrates the dependence of the following variables on the scaled mutation rate ( β ) and the scaled intensity of selection ( γ ), assuming a mutational bias ( κ ) of 2 towards the deleterious variant at a site): the mean frequency per site of the deleterious variant A ( p ), the expected diversity ( π ), the expected proportion of sites that segregate for variants ( f s ), and the above two measures of the rate of substitution relative to neutral expectation. Figure 2 illustrates the dependence of p and π on β at a finer scale, for different values of κ and γ . For clarity, the infinite sites values for p and the relative rates of substitution are not shown; with selection, the infinite sites values for these parameters are close to their values when β = 0.002. With neutrality, the exact value of p is always equal to the infinite sites value, and is independent of β for a given value of κ . With selection and low β (0.002 or 0.02), it can be seen that agreement with the infinite sites predictions is quite good for both these values despite the fact that the proportion of sites that are segregating can be quite substantial with β = 0.02; the second-order approximation of equations (7) gives very close agreement even for β = 0.2 with weak selection, but diverges for β > 0.2 when γ > 0.5 (results not shown), whereas the value 19 of p departs quite seriously from the infinite sites values at β = 0.2 when γ > 5. A similar pattern of departure from the infinite sites value holds for π , except when selection is strong ( γ = 5 or 50), when agreement is still good at β = 2; this is because the exact diversity and the infinite sites value both approach the deterministic value under mutation-selection balance when 1 << γ and β << γ (see equation 12). Somewhat surprisingly, the infinite sites and exact values of the proportion of sites that are segregating always agree well. Perhaps the most interesting result to emerge is that, with κ > 1, the rate of substitution relative to neutral expectation can exceed one when there is moderate selection and mutational bias towards the deleterious variants. This has long been known to apply to the infinite sites model when the “uncorrected” relative rate is used and when there is mutational bias (Eyre-Walker 1992; McVean and Charlesworth 1999), which can cause serious problems for phylogenetic inferences concerning selective constraints (Lawrie et al. β , the corrected relative rate can exceed one, even for γ = 5, and can be only just below one for lesser values of β . The reason for this seemingly paradoxical result is presumably the fact that nearly all sites are segregating when β is high; when p is sufficiently high because mutation and drift are overcoming selection, there is a substantial chance that a new mutation to the favorable variant A can arise at a segregating site, which has a higher chance of fixation than a neutral variant and hence contributes to an elevated substitution rate. With sufficiently strong mutational bias, p can be much greater than ½, so that the contribution from the enhanced fixation 20 probability of favorable mutations outweighs the lower contribution from the fixation of deleterious mutations. As was previously shown by McVean and Charlesworth (1999) for the infinite sites model, the equilibrium diversity with selection can also considerably exceed the neutral equilibrium value with the same mutational parameters, when there is a mutational bias towards deleterious alleles (see also Kondrashov et al. (2006)). For example, in Figure 1, with γ = 5 and β = 2, π = 0.43 but is 0.38 for the neutral case; with β = 20, and γ = 50 the values are 0.49 and 0.44, respectively. In this case, there is no meaningful way of correcting for differences in base composition between the neutral and selected sites when there are substantial departures from the infinite sites assumption, since the diversity in the neutral case is not related to the mean allele frequency in a simple way. Properties of a sample from a population

This raises the question of the extent to which the properties of a sample of alleles from a population are affected by deviations from the infinite sites assumption. With the above model, the probability that a sample of n alleles segregates for k A variants at a site and n – k copies of A can be obtained from the corresponding binomial distribution with parameter q , integrated over φ ( q ), and takes the form p ( k ) = nk ⎛⎝⎜ ⎞⎠⎟ C exp( γ q ) (1 − q ) α+ n − k − q β+ k − ∫ d q (17a) 21 where C is given by equation (3) (McVean and Charlesworth 1999; Desai and Plotkin 2008). Using the properties of the confluent hypergeometric function, this yields p ( k ) = nk ⎛⎝⎜ ⎞⎠⎟ F ( β + k , α + β + n , γ ) ( β ) k ( α ) n − k F ( β , α + β , γ ) ( α + β ) n (0 < k < n ) (17b) p (0) = F ( β , α + β + n , γ ) ( α ) n F ( β , α + β , γ ) ( α + β ) n (17c) p ( n ) = F ( β + n , α + β + n , γ ) ( β ) n F ( β , α + β , γ ) ( α + β ) n (17d) The proportion of sites that are observed to be segregating is p seg = − p (0) − p ( n ) (17e) The conditional site frequency spectrum (SFS for segregating sites can be obtained by dividing equation (17b) by (17e). The folded SFS for segregating sites (which describes the numbers of variants of either type at frequencies 1 up to 0.5 n + 1 ( n odd) or 0.5 n ( n even) can also readily be obtained. Equations (17) can readily be used to obtain the theoretical values of standard sample statistics, such as the diversity per site ( π ) (Tajima 1983), Watterson’s θ w = p seg /a n (Watterson 1975) and Tajima’s D (Tajima 1989b), using the standard formulae for these 22 quantities. A well-known problem with Tajima’s D is the fact that its magnitude is strongly dependent on both the level of variability in the population and on the length of sequence used to estimate it (Tajima 1989b). Langley et al. (2014) proposed the use of the summary statistic Δ π = ( π – θ w )/ θ w for measuring the extent of departure of the SFS from the infinite sites neutral equilibrium expectation, which should not suffer from these problems. Another summary statistic for this purpose is provided by the proportion of singleton variants among segregating sites, given by p sn = [ p (1) + p ( n − p seg (17f) (This is closely related to the widely used D statistic of Fu and Li (1993).) Use of the series expression for the confluent hypergeometric function allows rapid computation of all relevant statistics; to avoid overflow when γ is large, however, it is necessary to use logarithms of the individual terms and partial sums of the series (this requires the selection model to be defined such that γ > 0). A FORTRAN program is available on request to BC. Table 1 displays some examples of such computations, for the case of a mutational bias of 2 towards deleterious mutations, for a subset of the parameter values used in Figure 1. The expected π values are not shown, since these are the same as the population diversities given in Figure 1. Figure 3 show the folded SFSs for some chosen examples, using a sample size of 20 alleles. It can be seen that a high β value (20) means that the proportion of sites that are found to be segregating ( p seg ) is effectively 100%, even for γ as high as 50 and a sample size ( n ) of 20. A moderate β value (0.2) behaves 23 similarly in the neutral case with a sample size of 200, but otherwise is associated with a p seg of less than 80% (and is as low as 13% for n = 20 and γ = 50). With neutrality or weak selection ( γ ≤ β cause a distortion of the SFS towards a much lower proportion of singletons ( p sn ) and higher Tajima’s D and Δ π than is expected with the infinite sites model. Even for γ = 50, a very low p sn and a positive D are found when β = 20. This reflects the tendency of high β values to push the distribution of q towards intermediate frequencies, which has long been known (Wright 1931) . Some analytical approximations for p seg are derived in File S1. Discussion

The results described above have some important implications for the interpretation of data on DNA sequence variation and evolution when there is “hyperdiversity”, i.e., the scaled mutational parameter ( β in the notation used here) is sufficiently large that the infinite sites model does not accurately describe patterns of variation within populations. Recent surveys of DNA sequence polymorphisms show that that such hyperdiversity is more common than previously thought, even in multicellular organisms (Cutter et al. Arabidopsis thaliana and maize that epigenetic variants such as methylated cytosines can be transmitted fairly stably through meiosis, but have origination and disappearance rates that are several orders of magnitude higher than those of nucleotide variants (Johannes et al. et al. et al. et al.

Distortion of the SFS with hyperdiversity

As was pointed out about twenty years ago in the context of human mitochondrial DNA sequence variability (Bertorelle and Slatkin 1995; Aris-Brosou and Excoffier 1996; Tajima 1996; Yang 1996), a major effect of a high scaled mutation rate ( β in the notation used here) is that more intermediate frequency variants will be present at polymorphic sites in a sample from a population than under the equilibrium infinite sites model. In particular, for a stationary population at equilibrium between drift and the input of neutral or nearly neutral mutations, the expected values of Tajima’s D statistic ( D T ) and the Δ π statistic proposed by Langley et al. (2014) are positive rather than slightly negative or zero, respectively, as expected under the infinite sites model (Tajima 1989) – see Figure 1 and Table 1). This reflects the fact that the expected value of the pairwise diversity per site ( π ) is greater than the expected value of the measure of diversity based on the number of segregating sites at a locus ( θ w ). As can be seen from Table 1, this effect is quite noticeable even for β as low as 0.02 when selection is absent or weak, and small positive values of D T and Δ π are found with neutrality even when β = 0.002 (of the order of 1% with n = 200). With very high values of β , a positive Tajima’s D can occur even with quite strong purifying selection (a scaled selection parameter γ of 50) can be associated with (Table 1). A site frequency frequency spectrum (SFS) with an excess of intermediate frequency variants at loci across the genome is usually interpreted as indicating a recent 25 population bottleneck or a subdivided population, e.g. Staedler et al. (2009). False positive results for tests for bottlenecks and/or subdivision may thus be obtained if infinite sites rather than finite sites models are applied to hyperdiverse populations or epigenetic variation, even when moderately strong purifying selection is acting. Given that very small positive mean values across sites of statistics such Δ π can be statistically significant with genomic scale data and large sample sizes, caution should be exercized in using infinite sites predictions for such datasets. The suggested criterion for hyperdiversity of π or θ w of 5% for using finite sites models rather than the infinite sites model (Cutter et al. Caenorhabditis sp.5 by Cutter et al. (2012), where the within-population diversity at synonymous sites is about 0.08, Tajimas’ D values for “scattered” samples (where one allele per locus was sampled from each of 13 locations, in order to minimize departure from the standard coalescent process (Wakeley 2000)) were nearly all positive, with a mean of 0.28. This is consistent with the coalescent simulations of Cutter et al. (2012), who used the SIMCOAL2 program of Laval and Excoffier (2004) with a finite-sites model with equal mutation rates among all four possible nucleotides (A. Cutter and L. Excoffier, pers. comm.). The model used here gives an expected value of Tajima’s D of approximately 0.10 with γ = 0 or 0.5 and a mutational bias of 2, assuming a sample size of 13 and 150bp per locus (corresponding approximately to the numbers of synonymous sites in the study). At least qualitatively, this species thus fits the expectation under hyperdiversity for DNA sequence variability. 26 In contrast, the synonymous SFS in the much more hyperdiverse species C. brenneri is biased towards low frequency variants, with a mean Tajima’s D of –0.56 over 23 loci with a average of approximately 150bp per locus (Dey et al. Arabidopsis thaliana (Schmitz et al. et al.

A. thaliana, because the SFS for SNPs is far less biased towards rare variants (Schmitz et al.

C. brenneri . The second possibility is that purifying selection is sufficiently strong to skew the SFS towards rare variants. This seems unlikely in the case of

C. brenneri , where the estimates of the overall γ for synonymous sites suggests a value close to 0.5 (Dey et al. A. thaliana example, since high levels of methylation of 27 cytosines are non-randomly distributed across the genome, and are especially prevalent in transposable element sequences where methylation is important for their silencing (Schmitz et al.

A. thaliana (Becker et al. et al.

Limitations of the biallelic model

It is important to note that the biallelic model used here, which is similar to that used by Bertorelle and Slatkin (1995) and Desai and Plotkin (2008), is likely to underestimate the effect of hyperdiversity on the SFS, since the presence of more than two variants at a segregating site will result in higher π but not θ w . On the other hand, the infinite alleles 28 assumption apparently used by Aris-Brosou and Excoffier (1996) means that the upper limit to π is 1, whereas in reality there is a maximum of four segregating variants per site, leading to an upper limit to π of 3/4 (when all four variants are present at equal frequencies), as opposed to 1/2 for the biallelic model used here. Given the almost universal existence of mutational biases towards transitions versus transversions, and for GC to AT versus AT to GC mutations, the upper limit is in practice likely to be considerably smaller than 1, so that the biallelic model with modest mutational bias probably provides a reasonably good guide to the values of measures of skew in the SFS. An intermediate situation is provided by assuming a K -allele model (Ewens 2004 pp.192-200) with K = 4, corresponding to equal mutation rates among all 4 nucleotide states at site (Tajima 1996; Yang 1996; Desai and Plotkin 2008). Under neutrality the exchangeability of the different nucleotides under this model means that the probability density φ ( q i ) for the frequency q i of a variant of type i ( i = 1 – 4) is proportional to (1 − q i ) θ− q i ( θ /3) − , where θ is the net mutation rate per site, i.e. φ ( q i ) follows a beta distribution with parameters θ and θ /3 (Tajima 1996). With semi-dominant selection with type i having a selective advantage s over all other variants, which are assumed to be selectively equivalent to each other, this expression is simply multiplied by exp( γ q i ). Following Tajima (1996), these assumptions allow simple analytical formulae for the sample statistics used above to be obtained for the case of neutrality: π = θ / [1 + (4 θ / 3)] , p seg = − [ S n − ( θ / 3) / S n − (4 θ / 3)] , and p sn = n θ S n − ( θ ) / P seg S n − (4 θ / 3) , where S k ( x ) = (1 + x )(2 + x )...( k + x ) . These can be compared with the statistics obtained from the biallelic model in Table 1, setting θ to the equilibrium infinite sites neutral diversity with reverse mutation 2 βκ /(1 + κ ) = 4 β/3 (with 29 κ = 2) to obtain comparable net scaled mutation rates per site. As expected, for very low θ, the two models yield similar results, but even with β = 0.02 the 4-allele model gives noticeably higher expected values of Tajima’s D and Δ π ; e.g. with a sample size of 20 and β = 0.02, the values of Tajima’s D and Δ π are 0.069 and 0.022, respectively, versus 0.038 and 0.016 for the biallelic model. With a sample size of 20 and β = 0.2, the values of Tajima’s D and Δ π for the 4-allele model are 0.61 and 0.20, respectively, compared with 0.32 and 0.11 for the biallelic model; values of D and Δ π much greater than twice the biallelic values can be generated by the 4-allele model when β is large, reaching 4.7 and 1.6, respectively, with β = 20. The proportion of singletons behaves rather differently under the 4-allele model; it can even increase with β up to some upper limit, after which it declines, and is always higher than for the biallelic model (e.g. 0.36 versus 0.22, respectively, for β = 0.2 and n = 20; 0.17 versus 0.01 for β = 20). This behavior presumably reflects the fact that there are four possible variants at each site that can behave as singletons in the case of the 4-allele model, and the above formula simply sums over the probabilities that each one of these is a singleton, regardless of the status of the other three possible variants at the same site. A statistic such as Δ π is thus probably a better summary of the skew of the SFS than the proportion of singletons when a substantial fraction of polymorphic sites segregate for more than two variants, unless variants are collapsed into biallelic alternatives such as GC versus AT basepairs. For studying situations with multiple alleles per nucleotide and non-equilibrium demography, numerical methods such as that of Zeng (2010) will be needed. 30 Some other implications

One difficulty with interpreting the results of population surveys of epiallelic variation is that it is impossible to know whether sites that lack epigenetic marks in all individuals sampled are potentially capable of acquiring them. This means that the denominator in per-site statistics such as π and θ w is unknown, making it hard to apply standard population genetics methods to this kind of data. Fortunately, however, with high β values (> 0.2), nearly all sites capable of mutation will be found to be segregating in a large sample, even with a scaled selection strength as high as γ = 5 (Table 2); thus, the majority of sites capable of epimutations can be identified from population surveys, unless strong purifying selection is acting. Population surveys could, therefore, be a valuable tool for the characterization of the epigenome. Another finding that is relevant for both hyperdiverse DNA sequence variation and hypermutable epigenetic variation is the fact that substitution rates for sites under purifying selection may be close to, or even greater than rates at neutral sites with high β values. As described above, this may occur even after corrrecting for the effects of differences in base composition between neutral and selected sites (Figs 1 and 2). This lack of sensitivity of substitution rates to the strength of purifying selection is consistent with the patterns described by Cutter et al. (2013), where there is only a weak relation between codon usage bias and a measure of synonymous site divergence in the hyperdiverse species Ciona savigni . Similarly, diversity at sites subject to weak purifying selection is expected to show a non-linear pattern of relationship with γ , such that π increases with γ when sites are close to neutral, and then declines again with as γ approaches or exceeds 1; the range of γ values over which there is an increase is broader 31 for large β (Figure 2). Synonymous diversity of genes in C. brenneri does indeed show a quadratic relation with the frequency of optimal codons, such that genes with approximately 50% optimal codons have the highest diversity values (Asher Cutter, personal communication). With γ values typical of those reported from studies of selection on codon usage ( γ = 1 or less), the standard Li-Bulmer equation (Li 1987; Bulmer 1991) tends to overestimate the expected level of codon bias, as measured by the mean frequency of the favored allelic type ( q ), when β > 0.02. For example, with γ = 1 and β = 0.2, the exact value of q from equation (5a) is 0.49 compared with the Li-Bulmer infinite sites prediction of 0.58, while the second-order approximation from equations (7) gives 0.44. Analyses of codon usage in hyperdiverse species that use codon usage data to estimate γ , (see Sharp et al. (2010)), should probably use the exact expression. It is interesting in this context to note that there is there is only a small difference in the mean level of codon usage bias between C. brenneri and

C. remanei , despite an approximately three-fold difference in synonymous site diversity (Asher Cutter, personal communication). This raises the question of whether the purifying selection model used here is appropriate for codon usage, or whether a model of stabilizing selection (Kimura 1981) is more realistic, since the latter means that γ is insensitive to N e over a wide range of parameter values, provided that there is mutational bias (Charlesworth 2013). The behavior of this model with hyperdiversity is, therefore, worth studying. 32 Sample versus population distributions; reversible versus irreversible mutational models

Lawrie et al. (2013) have proposed that, for the irreversible mutation model with large n (see below), it is computationally more efficient to replace the sampling formula for p ( k ) by multiplying the probability density φ ( q ) by 1/ n , i.e. d q is equated to 1/ n , to obtain the probability of obtaining an allele frequency q = k / n in the sample. The corresponding procedure for the reversible mutation model yields p pop ( k = qn ) = C exp( γ q )(1 − q ) α− q β− n − (18) This relation can be used to obtain statistics analogous to those described by equations (17b) to (17f), using summation over all values of k between 1 and n – 1 to obtain statistics such as the proportion of segregating sites, the expectations of the diversity measures π and θ w , and the proportion of singletons. Table 2 displays the results of computations using this formula for n =200. These show that the use of the population distribution instead of the exact sampling distribution overestimates the above statistics, unless selection is extremely strong ( γ >> 50). Tajima’s D , on the other hand, is usually underestimated, reflecting the overestimation of the proportion of singletons. These results can be compared with the approach of Lawrie et al. (2013), by noting that, for strong selection and small β , the reversible mutation model converges on the irreversible mutation model, which assumes that mutations are all from A to A , so that the relevant scaled mutation rate is α (the convergence can never be exact, since the irreversible mutation model cannot achieve true stationarity). 33 The results for γ = 50 and 500 in Table 2 should thus correspond closely to those for the irreversible mutation model, especially for β = 0.002. This was confirmed by direct calculation of the properties of the irreversible mutation model. The sample SFS was determined using the series expansion derived by Welch et al. (2008, equation 8), which is analogous to equations (17), replacing their θ with α . A similar series expansion can be obtained for the SFS and proportion of segregating sites generated from the population distribution (see Appendix). Comparisons of the SFSs for segregating sites for the reversible and irreversible mutation models shows close agreement between the two when β = 0.002, even for γ as low as 5; with β ≥ γ = 5 the proportions of singletons using the sample SFS are 0.244 and 0.226 for the reversible model with β = 0.002 and 0.02, respectively, compared with 0.248 for the irreversible model. The population distributions give very similar values. The proportions of segregating sites from the sample distribution for the reversible model with γ = 5 are 0.016 and 0.146 for β = 0.002 and 0.02, respectively, compared with 0.016 and 0.158 for the irreversible model; the corresponding population values for the reversible model are 0.029 and 0.201 for β = 0.002 and 0.02, respectively, while the corresponding values for the irreversible model are 0.032 and 0.315. In general, for β = 0.02, the irreversible model tends to give a larger discrepancy between the sample and population distribution values of the proportion of segregating sites than that seen under the reversible model. It seems, therefore, that methods such as those of Lawrie et al. (2013), which make use of the proportion of segregating sites to estimate the intensity of selection, will be substantially biased by using the population distribution rather than the sample distribution, especially with the 34 irreversible model, even when the SFSs for the two distributions are very similar. Indeed, since computationally efficient algorithms for generating the SFS for samples in the case of a stationary population are readily implemented, it is unnecessary to use the short-cut proposed by Lawrie et al. (2013). Acknowledgments

We thank Deborah Charlesworth, Asher Cutter and Kai Zeng for their comments on the manuscript, and Asher Cutter and Laurent Excoffier for valuable discussions.

Literature Cited

Abramowitz, M., and I. A. Stegun, 1965

Handbook of Mathematical Functions . Dover Publications, New York, NY. Aris-Brosou, S., and L. Excoffier, 1996 The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism. Mol. BioL. Evol. et al. , 2011 Spontaneous epigenetic variation in the

Arabidopsis thaliana methylome. Nature

Elements of Evolutionary Genetics . Roberts and Company, Greenwood Village, CO. Cutter, A. D., R. Jovelin and D. Dey, 2013 Molecular hyperdiversity and evolution in very large populations. Mol. Ecol.

Caenrhabditis brenneri . Proc. Nat. Acad. Sci. USA

Mathematical Population Genetics. 1. Theoretical Introduction . Springer, New York. Eyre-Walker, A., 1992 The effect ofconstraint on the rate of evolution in neutral models with biased mutation. Genetics

Drosophila . Gen. Res. et al. , 2009 Assessing the impact of transgenerational epigenetic variation on complex traits. PLOS Genetics e1000530. 37 Kimura, M., 1962 On the probability of fixation of a mutant gene in a population. Genetics In press . Lauria, M., S. Piccinni, R. Pirona, G. Lund, A. Viotti et al. , 2014 Epigenetic variation, inheritance, and parent-of-origin effects of cytosine methylation in maize (

Zea mays ). Genetics D. melanogaster . PLoS Genetics e1003527. Leffler, E. M., K. Bullaughey, D. R. Matute, W. K. Meyer, L. Ségurel et al. , 2012 Revisiting an old riddle: what determines genetic diversity levels within a species? PLoS Biology e1001388. Li, W.-H., 1987 Models of nearly neutral mutations with particular implications for non-random usage of synonymous codons. J. Mol. Evol. e87655. Schmitz, R. J., and J. R. Ecker, 2012 Epigenetic and epigenomic variation in Arabidopsis thaliana . Trnds. Plant. Sci. et al. , 2011 Transgenerational epigenetic instability is a source of novel methylation variants. Science et al. , 2013 Patterns of population epigenomic diversity. Nature

Coalescent Theory. An Introduction.

Roberts and Company, Greenwood Village, CO. Watterson, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. Appendix

The expected number of new mutations that arise at a segregating site

Assume that we have a nucleotide site that is segregating for a neutral mutation that arose at an initial frequency of 1/(2 N ). Let the probability that this variant mutates to an alternative nucleotide be u per generation (this includes the possibility that it reverts to the ancestral state); let the probability that the ancestral variant mutates to another state be v (this includes the possibility that the mutation is identical in state to the variant that is already segregating). If the frequency of the mutation in the population in a given generation is x , the expected total number of mutational events is 2 N [ ux + v (1 – x )]. The expected time that the original mutation spends in the frequency interval x to x + d x is given approximately by 4 N e /(1 – x ) for 0 < x ≤ N ), and 2 N e /( Nx ) for 1/(2 N ) < x ≤ NN e { ux + v (1 − x )](1 − x ) N ) ∫ d x + [ ux + v (1 − x )] Nx N )1 ∫ d x } ≈ N e [ u + v ln(2 N )] (A1) Approximations to equations (5) with small α and β Equation (5a) is equivalent to q = [ + γ i ( β+ i i ! ( α+ β+ i ] i = ∞ ∑ [ + κ+ γ+ γ i ( β+ i + i !( α+ β+ i + ] i = ∞ ∑ (A2) 42 We can write terms of the form ( β + i – j )/( α + β + i – j ) as 1 – [ βκ /( i – j )] + O ( β ); keeping only O ( β ) terms, we have q ≈ {1 + γ i i ! [1 − βκ ( i + − j )]} j = i ∏ i = ∞ ∑ {1 + κ+ γ+ γ i i ! [1 − βκ ( i − j )]} j = i − ∏ i = ∞ ∑ (A3a) or q ≈ {1 + γ i i ! exp( −βκ a i + )} i = ∞ ∑ {1 + κ+ γ+ γ i i ! exp( −βκ a i )} i = ∞ ∑ (A3b) where a i = 1 + 1/2 + 1/3 + … 1/( i –1) ( i ≥ O ( β ), yielding equation (6) of the text. Approximations for the frequencies of the fixed classes

Assuming that α << 1 and β << 1, and employing the approximations used in equation (A3b), we find that f f ≈ Γ ( α+ β ) Γ ( α ) Γ ( β ) { + κ [1 + κ+ γ+ γ i i ! exp( −βκ a i )] i = ∞ ∑ + o ( β ) } β − (2 N ) −β [1 + O ( γ N − ) + O ( β N − )]

43 (A4a) Similarly, f f ≈ Γ ( α+ β ) exp( γ ) Γ ( α ) Γ ( β ) { + κ [1 + κ+ γ+ γ i i ! exp( −βκ a i )] i = ∞ ∑ + o ( β ) } ( βκ ) − (2 N ) −βκ [1 + O ( γ N − ) + O ( β N − )] (A4b) We can use the fact that Γ (1 + x ) = x Γ ( x ) to write Γ ( x ) = Γ (1 + x )/ x . For small x , the representation of the gamma function as an infinite product (Abramowitz and Stegun 1965) implies that Γ ( x ) (1 – c x ) + o ( x ), where c ≈ Γ ( x ) for small x by 1/ x , and the term involving gamma functions in equations (9) is then κβ /(1 + κ )[1 + O ( β)] . Using a similar approximation to that used in equations (7), and neglecting higher-order terms in β , we obtain f f ≈ κ exp( −γ )(2 N ) − β [1 + κ exp( −γ )] { + βκ h exp( −γ )[1 + κ exp( −γ )] } [1 + O ( γ N − ) + O ( β N − )] (A4c) f f ≈ (2 N ) − βκ [1 + κ exp( −γ )] { + βκ h exp( −γ )[1 + κ exp( −γ )] } [1 + O ( γ N − ) + O ( β N − )] (A4d) where 44 h = γ i i ! a ii = ∞ ∑ (A4e) The higher-order terms in β vanish when γ = 0, suggesting that these expressions are good approximations when β and γ are both small. More rigorously, for finite i , a i is less than some constant A , which is approximately equal to ln( i – 1). If terms in i > k can be neglected in the sum that defines h , h < ln( k ) exp ( γ ), so that h exp(– γ ) < βκ ln( k ), where ln( k ) is a small multiple of one unless γ is very large. In addition, for arbitrary γ , the terms involving (2 N ) – β in equations (A4c) and (A4d) are equal to 1– β ln(2 N ) + O ( β ) and 1– βκ ln(2 N ) + O ( β ), respectively. Provided that ln(2 N ) is of order one, f f and f f are each equal to their respective infinite sites value, multiplied by a factor 1 – O ( β ), implying that the infinite sites values provide a good approximation unless β >> 0. Fixations of mutations

Consider first the case of A to A mutations that arise at a site that was initially fixed for A . We approximate the frequency of this fixed class, f f , by the integral in equation (9b). The fixation probability of an A mutation with initial frequency 1/(2 N ) when N is large is γ (2 N ) -1 [exp( γ ) – 1] -1 + O [ γ (cid:0) (2 N ) –2 ], so that the net number of new A mutations that arise in a given generation and are expect to become fixed is 2 N κ v f f { γ /(2/ N )[exp( γ ) – 1] -1 + O [ γ (cid:0) (2 N ) –2 ] } = kv f f { γ [exp( γ ) – 1] -1 + 2 N O [ γ (cid:0) (2 N ) –2 ]}. Using the same approximation for Q , and the fact that q is close to one in equation (9b), the corresponding formula from equations (14) and (15a) is 45 κ v Q − N )1 ∫ ( p ) p − q φ ( q ) d q = κ v { γ [exp( γ ) − − φ ( q ) d q − n )1 ∫ + O [ γ (2 N ) − ] } (A5) Provided that 2 N is sufficiently large in relation to γ , so that the higher order terms in γ (2 N ) –1 can be ignored, the two results are equivalent. The following argument can be used for the other end of the frequency range. In this case, there is no contribution from the class fixed for A mutations (frequency f f , as given by equation (9a)) to the fixation of new A mutations. The corresponding formula from equations (16) and (17a) is κ v Q N ) ∫ ( p ) p − q φ ( q ) d q = κ v (2 N ) − {1 + O [(1 + γ )(2 N ) − ]} (A6) Again, provided that 2 N is sufficiently large in relation to γ , the two results are equivalent. Parallel arguments can be used for the fixation of new A mutations. The relative rate of substitution under the infinite sites assumption

At equilibrium between mutation, drift and selection, the frequencies of sites fixed for A and A under the infinite sites model are approximated by κ exp(– γ )/[1 + κ exp(– γ )] and 1/)/[1 + κ exp(– γ )], respectively (Li 1987; Bulmer 1991; McVean and Charlesworth 1999). Averaging over the contributions from mutations arising at each class of fixed 46 sites, taking into account their respective fixation probabilities, the equilibrium rate of nucleotide substitution is then λ ( γ ) = κ v γ [1 + κ exp( −γ )][exp( γ ) − (A7a) (Charlesworth and Charlesworth 2010, p.275). If we consider neutral mutations arising at fixed sites with the same frequencies of A and A variants as the selected sites (i.e., with the same base composition), the substitution rate is λ (0) = κ v [1 + exp( −γ )][1 + κ exp( −γ )] (A7b) The ratio R ( γ ) = λ ( γ )/ λ (0) gives the rate of substitution of selected mutations relative to neutral expectation, conditioning on the same base composition; we have R ( γ ) = γ [exp( γ ) − exp( −γ )] (A8) It is easily seen that R = 1 at γ = 0, and decreases as γ increases. Population distribution statistics for the irreversible mutation model

47 The proportion of segregating sites in a sample of size n can be estimated by integrating the probability density of the frequency p of the deleterious allele A (Wright 1938), from 1/ n to 1 – 1/ n . In the present case, this density can be written as ψ ( p ) = α [ p − + q − ][exp − ( γ p ) − exp( −γ )][1 − exp( −γ )] (A9) The unconditional population distribution SFS is obtained by multiplying this by 1/ n , similarly to equation (18). By expanding the integrals of p –1 exp(– γ p ) and q –1 exp(– γ p ) as power series, and collecting terms, the proportion of segregating sites is given by ln( n ) + [1 − exp( −γ )] γ i ( i + i ! i = ∞ ∑ [( − i − exp( −γ )] (A10) Alternatively, the unconditional population distribution SFS can be summed from 1/ n to 1 – 1/ n ; numerical studies show that the two procedures give almost identical results. The conditional SFS is then obtained by normalising the unconditional SFS by the proportion of segregating sites. 48 Table 1 Sample statistics for the reversible mutation model ( κ = 2) n = 20 n = β p seg p sn D T Δ π p seg p sn D T Δ π γ = 0 γ = 0.5 γ = 5 γ = 50 p seg is the proportion of sites that are segregating, p sn is the proportion of singletons among segregating sites in a sample of size n , D T is the expected mean Tajima’s D for a sequence of 450bp, Δ π = ( π – θ w )/ θ w , where θ w = p seg / a n and a n = 1+ ½ +….1/( n –1). All these statistics were calculated using equations (17). Table 2 Comparisons of statistics derived from the sample distribution versus the population distribution p seg p sn π θ w D T Sample Popn. Sample Popn. Sample Popn. Sample Popn. Sample Popn. β = 0.002 γ = 0 0.0156 0.0288 0.169 0.170 0.0026 0.0049 0.0026 0.0049 0.007 0.007 γ = 0.5 0.0167 0.0311 0.171 0.172 0.0028 0.0053 0.0028 0.0053 –0.010 –0.012 γ = 5 0.0159 0.0291 0.245 0.246 0.0016 0.0029 0.0027 0.0050 –0.752 –0.832 γ = 50 0.0068 0.0134 0.492 0.512 0.0002 0.0003 0.0011 0.0019 –1.260 –1.470 γ = 500 β = 0.02 γ = 0 0.142 0.196 0.159 0.161 0.0251 0.0345 0.0242 0.0333 0.089 0.082 γ = 0.5 γ = 5 γ = 50 γ = 500 κ =2, n = 200. See Table 1 for explanation of the meaning of the column headings. 51 Figure Legends Figure 1.

The vertical bars are the values (in percentages) of the mean frequency of A , p , (red), π from equation (11b) (blue), π as given by the infinite sites model (black), the proportion of segregating sites from equation (17e) (white), the proportion of segregating sites under the infinite sites model (pink), the ‘uncorrected’ rate of substitution relative to neutrality (light blue), and the ‘corrected’ rate of substitution relative to neutrality (green). Figure 2.

The curves are the values (in percentages) as functions of β for the mean frequency of A , p , (red, dashed), π from equation (11b) (blue, full), π as given by the infinite sites model (green, full), the ‘uncorrected’ rate of substitution relative to neutrality (black, dashed), and the corrected’ rate of substitution relative to neutrality (pink, dashed). Figure 3.

The vertical bars are the values (in percentages) of the probabilities of finding the minor allele in a sample of 20 at the frequencies indicated on the x axis, for different values of β and γ . 52 Figure 1 β (cid:1) γ = 0 γ = 0.5 (cid:0) γ = 5 (cid:0) γ = 50 (cid:0) Figure 2 γ = 0 γ = 1κ = 4 (cid:1) Figure 3

Frequency of minor allele γ = 0.5 (cid:1) γ = 5.0 (cid:1) γ = 50 (cid:1) γ = 0 (cid:1)(cid:1)