[PDF] A model for the clustered distribution of SNPs in the human genome

Abstract

Motivated by a non-random but clustered distribution of SNPs, we introduce a phenomenological model to account for the clustering properties of SNPs in the human genome. The phenomenological model is based on a preferential mutation to the closer proximity of existing SNPs. With the Hapmap SNP data, we empirically demonstrate that the preferential model is better for illustrating the clustered distribution of SNPs than the random model. Moreover, the model is applicable not only to autosomes but also to the X chromosome, although the X chromosome has different characteristics from autosomes. The analysis of the estimated parameters in the model can explain the pronounced population structure and the low genetic diversity of the X chromosome. In addition, correlation between the parameters reveals the population-wise difference of the mutation probability. These results support the mutational non-independence hypothesis against random mutation.

Full PDF

aa r X i v : . [ q - b i o . GN ] M a y A model for the clustered distribution of SNPs in thehuman genome

Chang-Yong Lee The Department of Industrial and Systems Engineering, Kongju National University,Cheonan, 330-717, South Korea

Abstract

Motivated by a non-random but clustered distribution of SNPs, we introducea phenomenological model to account for the clustering properties of SNPs inthe human genome. The phenomenological model is based on a preferentialmutation to the closer proximity of existing SNPs. With the Hapmap SNP data,we empirically demonstrate that the preferential model is better for illustratingthe clustered distribution of SNPs than the random model. Moreover, the modelis applicable not only to autosomes but also to the X chromosome, although theX chromosome has diﬀerent characteristics from autosomes. The analysis ofthe estimated parameters in the model can explain the pronounced populationstructure and the low genetic diversity of the X chromosome. In addition,correlation between the parameters reveals the population-wise diﬀerence of themutation probability. These results support the mutational non-independencehypothesis against random mutation.

Keywords: single nucleotide polymorphism, mutation, human genome,probability distribution, Hapmap

1. INTRODUCTION

The most common type of genetic variants in the human genome is the singlenucleotide polymorphism (SNP), which, as a result of mutation, has a diﬀerence [email protected] Preprint submitted to Computational Biology and Chemistry June 30, 2018 n a single nucleotide within a population of samples [1]. SNP data, togetherwith gene expression and other biological information, are an important resourceto answer various biological questions regarding the genetic variation, such asthe mutational pattern of the genome, the phylogenetic classiﬁcation, and theassociation with phenotype data.In recent years, as the cost of genotyping has dropped dramatically duemainly to the advance in the genotyping technology [2, 3], much eﬀort has beenput into the identiﬁcation of SNPs in the human genome [4, 5]. Notably, theInternational HapMap project [4] (hereafter, Hapmap) is an international eﬀortto identify the genetic variation in the human genome to develop a haplotypemap. Although Hapmap includes some datasets on the copy number variation,SNP data are the main resource not only for understanding and characterizingthe diﬀerences in genome structure but for association studies with diseasesand/or environmental factors.It has been known that SNPs in the human genome are not distributedrandomly but clustered across the genome [6, 7, 8, 9, 10]. This clustering prop-erty suggests that mutations tend to occur not randomly but preferentiallyto the proximity of existing mutations. In addition to the interpretation ofthe clustering as the reﬂection of mutational hotspots [11], clustered SNPs canemerge in various ways. Natural and balancing selections can modulate localvariability and tend to create regions of increased variability that results in non-randomness [12]. A high variance of genes within a population of samples inthe time to the most recent common ancestor causes diﬀerent recombinationrates in a chromosome [13]. It was also proposed that microsatellites can alsogenerate mutational biases in their ﬂanking regions by expansion and erosionfrom the perspective of microsatellite evolution [14, 15, 16]. Clustered SNPscan also arise from ascertainment biases in the SNP discovery process [17]. Ex-amples include the SNP identiﬁcation based on maximally dissimilar sequences,the usage of not enough samples, and ﬁnding all possible SNPs not on a wholegenome but on a given region of a chromosome.However, when SNP clusters are found throughout a whole genome with a2arge number of samples from diﬀerent global populations, it is unlikely that theobserved clusters would be due to ascertainment biases. Thus, as pointed out inRef. [6], the majority of SNP markers along a whole genome should reﬂect theunderlying mutation pattern. In this respect, a non-random mutation processwas proposed and tested against the random mutation by generating a semi-realistic population of chromosomes from stochastic computer simulations thatimplements the concept of ‘the sphere of inﬂuence’ [6]. As millions of SNPs ona whole genome are now available in public domains, the mutation pattern canbe systematically investigated.In this paper, we propose a probabilistic model for the clustered distri-bution of SNPs. The proposed model assumes non-independent mutations inwhich subsequent mutations occur not randomly but preferentially to near mu-tated sites. Within the model, SNP clusters could form mainly through a non-negligible tendency of the mutation process in the closer proximity of existingSNPs. The proposed model was tested against Hapmap SNP data and the pro-posed model was conﬁrmed as suitable to explain empirical SNP distributionsof the human genome. We also tested the proposed model against the randommutation model in which all mutations occur independently, and we conﬁrmthat the proposed model explains the distribution more appropriately than therandom model.As the X chromosome is a haploid in males, its SNP distribution may havecharacteristics diﬀerent from the distributions of the autosomes. With the esti-mated parameters in the proposed model, we characterized the clustered SNPdistributions obtained from diﬀerent chromosomes, including the X chromo-some. Whereas the proposed model is valid irrespective of the ploidy (i.e.,either diploid or haploid), our analysis of estimated parameters accounts forthe characteristics, such as the pronounced population structure and the lowmutation rate, speciﬁc to the X chromosome.3 . MATERIALS AND METHODS

We use the genome-wide Hapmap SNP data of Phase III, which consists of1,440,616 SNPs in 1184 reference individuals from 11 global ancestry groups ofthree continental regions. The data are publicly available and can be down-loaded at http://hapmap.ncbi.nlm.nih.gov/. To investigate the population-speciﬁc diﬀerences, we extracted SNPs that are polymorphic within each ofa single population. For each of 11 global populations, we analyzed SNP dis-tributions on 22 autosomes and the X chromosome. Generally, the number ofSNPs depends on the number of samples, the chromosome, and the genetic di-versity of the population. Thus, the number of identiﬁed SNPs may ﬂuctuatewith populations and chromosomes.Table 2.1 shows the number of SNPs identiﬁed in each population summedover all chromosomes and each chromosome averaged over 11 global populations.The number of SNPs for each population in Table 2.1 roughly illustrates theregional diﬀerence in the genetic diversity. The populations that originatedfrom Africa (ASW, LWK, MKK, YRI) have a larger number of SNPs thanother continental regions, indicating a higher genetic diversity. On the otherhand, the population from Asia (CHB, CHD, JPT) have a smaller number ofSNPs and a lower genetic diversity than others. The clustering property of SNPscan be represented by taking the ordered locations of all SNPs and examininghow proximate they are located. We quantify the proximity of SNPs in termsof the SNP space, which is deﬁned as the number of nucleotides between twoadjacent SNPs in their ordered locations. Speciﬁcally, let ℓ i be the location, inthe number of nucleotides counting from 5 ′ of a sequence, of the i th SNP in achromosome. Then, the i th SNP space s i is deﬁned as s i ≡ ℓ i +1 − ℓ i , i = 1 , , · · · . (1)4 able 1: The number N of SNPs in each of single population (left two columns) and in eachchromosome averaged over 11 populations (right four columns). The population names areabbreviated and the full names can be found in Ref. [4]. Note that the number of SNPs in eachsingle population and the total number of SNPs of all chromosomes are less than 1,440,616SNPs obtained from all global populations. pop. name N chr. N chr. N ASW 1,399,533 1 102,848 13 46,752CEU 1,269,095 2 103,994 14 40,899CHB 1,181,090 3 86,600 15 37,884CHD 1,173,514 4 77,006 16 39,392GIH 1,260,550 5 79,132 17 33,734JPT 1,154,331 6 82,545 18 36,802LWK 1,366,422 7 67,937 19 23,092MEX 1,324,625 8 67,576 20 32,426MKK 1,385,579 9 57,444 21 17,557TSI 1,266,622 10 66,389 22 17,826YRI 1,340,306 11 63,333 X 41,40812 61,202 total 1,283,778

An SNP is the result of a mutation and it is commonly assumed that themutation arises randomly in DNA sequence. Under this assumption, we testedthe hypothesis that SNPs are distributed randomly in a sequence. In the randommutation model, mutations occur independently of each other with a constantprobability β of 0 < β <

1. With the random model, the probability distributionof the SNP space deﬁned in Eq. (1) can be derived as follows. Suppose that anSNP is found at the location ℓ k . Then, the probability of ﬁnding a subsequentSNP at the location ℓ k +1 = ℓ k + s is given as p ( s | β ) = (1 − β ) ( s − β , for s = 1 , , · · · (2)5 -3 -2 -1 R e l a t i v e f r equen cy SNP space, s (bp) chr_1 chr_6 chr_11 chr_16 chr_21

Figure 1: Empirical frequency distributions of SNP space for diﬀerent chromosomes of ASW.The dotted line represents a geometric distribution with an arbitrary slope. The vertical axisis in a log scale.

Note that Eq. (2) is the probability mass function of a geometric distribu-tion [18]. The geometric distribution is the discrete analogue of the exponentialdistribution and has a property of being memoryless. The distribution is oftenused for modeling the number of trials until the ﬁrst success, in our case, anSNP.By taking a logarithm on the both sides of Eq. (2), we getln p ( s | β ) = ln(1 − β ) s + ln β − β . (3)This illustrates that ln p ( s | β ) is linear in s with ln(1 − β ) < β can be estimated, for example, bythe maximum likelihood estimation (MLE). Thus, if mutations occur randomly,we expect that ln p ( s | β ) should be a straight line in s with a negative slope ofln(1 − β ).In Fig. 1, we plot the empirical distribution of the SNP spacing s for diﬀerent6hromosomes of ASW. A comparison with the random model reveals that theempirical distributions do not follow a geometric distribution (the dotted line),especially for small values of s in which the major deﬁciency of the model occurs.In particular, Fig. 1 shows that the probability increases sharply and non-linearly as s becomes small which suggests that the SNPs are clustered. Thistendency of the distribution is more or less independent of the chromosome.This reﬂects that SNPs are distributed not randomly, but clustered. From thesewe can infer that the random mutation hypothesis is inadequate to explain theclustering property of SNPs. The clustered SNPs imply that when a mutation occurs, another mutationis more likely to occur as they are closer in their locations. This suggests thatthe mutation probability is not independent of the location but dependent onhow close the mutations are in their locations. This non-independent mutationcan be modeled after the mutation probability being inversely proportional tosome power of the separation in nucleotides between two consecutive mutations.Formally, given that a mutation occurs at the location ℓ k , the probability of thenext mutation at ℓ k + d can be expressed as r ( d ) = λd α (4)with 0 < λ < α ≥ λ and α , to be estimated.The λ is the probability of two mutations occurring consecutively in genomiclocations, and the α is the strength of the mutational non-independence. Thelarger is α , the smaller mutation probability as d increases. In particular, when α = 0, the proposed model reduces to the random model of Section 2.2.It should be noted that the proposed model is a phenomenological modelto account for the observed clustered SNP distributions, rather than a modelbased on molecular mechanism. That is, the phenomenological model doesnot implement any biological mechanism of mutation, such as the heterozygote7nstability. Rather, this model is based on the fact that if the mutation doesnot occur randomly, there must be some dependence on the SNP space.With the proposed model, the probability of SNP space s is given, for s =2 , , · · · , as p ( s | λ, α ) = (cid:18) − λ α (cid:19) (cid:18) − λ α (cid:19) · · · (cid:18) − λ ( s − α (cid:19) λs α = λ Q s − k =1 ( k α − λ )( s !) α . (5)Or, by taking a logarithm of Eq. (5), we haveln p ( s | λ, α ) = s − X k =1 ln( k α − λ ) + ln λ − α ln( s !) . (6)As the right hand side of Eq. (6) is intractable analytically and even computa-tionally, we resort to an approximate expression.It is known that there is one SNP in roughly every 1,000 base pairs. Thus,we assume λ ≪ k α − λ ) around λ = 0 to the ﬁrst order in λ toget ln( k α − λ ) ≈ α ln k − k − α λ . (7)Using an integral approximation of s − X k =1 k − α ≈ Z s k − α dk = s − α − − α , (8)and the Stirling’s formula of ln( s !) ≈ s ln( s ) − s , Eq. (6) can be expressedapproximately asln p ( s | λ, α ) ≈ λ − α (cid:0) − s − α (cid:1) + ln λ − α ln s . (9)To show that the probability of s in the random model [Eq. (3)] is a special caseof Eq. (9), we expand Eq. (3) to the ﬁrst order in β to getln p ( s | β ) ≈ β (1 − s ) + ln β . (10)We can easily show that Eq. (9) becomes Eq. (10) when α = 0.The parameters λ and α in the proposed model can be estimated, for ex-ample, by MLE. Under the assumption that empirical SNP space data ~s =8 P r obab ili t y , l n p ( s ) SNP space, s chr_1 chr_11 chr_X

Figure 2: Plots of empirical distributions of SNP space by using SNPs on ASW for the chromo-some 1 (boxes), the chromosome 11 (circles), and the chromosome X (triangles), together withtheir distributions obtained by the proposed model (solid lines). The dotted line representsthe distribution of the chromosome 11 obtained by the random model with the parameter β estimated by MLE. The distributions of the chromosomes 11 and X are shifted vertically forthe display purpose, and the vertical axis is in a log scale. { s , s , · · · , s n } are independent and identically distributed samples, we con-sider the log-likelihood L ( ~s | λ, α ). It is given as L ( ~s | λ, α ) = λ − α ( n − n X i =1 s − αi ) + n ln λ − α n X i =1 ln s i . (11)With the log-likelihood and MLE, we can estimate the parameters of λ and α by solving two equations, ∂L∂λ (cid:12)(cid:12)(cid:12)(cid:12) ˆ λ = 0 and ∂L∂α (cid:12)(cid:12)(cid:12)(cid:12) ˆ α = 0 , (12)simultaneously using, for instance, an iterative method [19].Although the phenomenological model that we used in this study is muchsimpler and unrealistic than the heterozygote instability model, the phenomeno-logical model is intractable mathematically unless we make approximations.9 .0015 0.0020 0.0025 0.0030 0.00350.180.200.220.240.260.280.30 X-chromosome C C C C Figure 3: Plots of estimated α versus λ for all 253 chromosomes (23 chromosomesfrom each of 11 global populations). The values for the X chromosomes from 11global populations are plotted by ﬁlled boxes. The pairs of ( α , λ ) for 22 auto-somes are clustered into four clusters. They are: C = { ✷ ) , (cid:13) ) } , C = { ✷ ) , △ ) , ▽ ) , (cid:13) ) , ✸ ) } , C = { ✸ ) , △ ) , (cid:13) ) , ✷ ) , ▽ ) , ⊳ ) } , and C = { ∗ ) , ⊲ ) , ⊳ ) , (cid:13) ) , ▽ ) , ⋆ ) , △ ) , ✷ ) , ✸ ) } . Thus, more realistic model such as the heterozygote instability model is noteasy to implement mathematically and/or analytically. This is one of reasonswhy any biologically plausible process hardly be modeled mathematically.

3. RESULTS AND DISCUSSION

Using Hapmap SNP data, we estimate two parameters λ and α for all chro-mosomes of each single population. With the estimated parameters, we presenta few ln p ( s | λ, α ), together with the corresponding empirical distributions inFig. 2. As shown in Fig. 2, the model distribution ﬁts well to the empiricaldistribution, especially to small s of non-linearly increasing region represent-ing clustered SNPs. Moreover, Fig. 2 demonstrates that the proposed model10 .0015 0.0020 0.0025 0.0030 0.00350.120.140.160.180.200.220.240.260.28 α λ (a) λ α (b) Figure 4: (a) Plot of estimated α versus λ for all chromosomes averaged over 11 globalpopulations. The vertical and horizontal error bars are the standard deviations of α and λ , respectively. (b) Plot of estimated α versus λ averaged over 23 chromosomes for eachsingle population: African populations (circles), Asian populations (boxes), and the otherpopulations (triangles). The vertical error bars are the standard deviations of α and thestandard deviations of λ are omitted as they are too large to display. The dotted linesrepresent linear ﬁts with the least square error. is equally applicable not only to the autosomes but to the X chromosome. Wealso see that the proposed model is more adequate than the random model (dot-ted line in Fig. 2). This result supports, at least partially if not entirely, thenon-independent mutation hypothesis.As the X chromosome is haploid in males, it has smaller population size thanthe autosome has. Thus, we may expect that the clustering property of SNPson the X chromosome may diﬀer from that on the autosomes [6]. A consequenceof a smaller population size of the X chromosome is a low abundance in SNPs.It is known that the nucleotide mutation rate in females is much lower thanin males [20]. The smaller population size and low mutation rate result in thegenetic diversity of the X chromosome about a half of that of the autosomes [21].11hus, we expect that SNPs on the X chromosome to be about half abundancethat of the autosomes [6, 21]. From the clustering perspective, this implies thatSNPs on the X chromosome are less clustered than those on the autosomes inthe sense that the average of SNP space s over the X chromosome is largerthan that over the autosomes. That is, larger s is more probable on the Xchromosome than on the autosomes. This characteristic can also be seen by thevisual inspection of Fig. 2.To measure the abundance in SNPs, we ﬁt the empirical distribution oflog p ( s ) in terms of a linear function (i.e., the random model of Eq. (3)) in s sothat the slope ln(1 − β ) of the ﬁtted function approximately quantify the degreeof clusteredness. We estimated the slope for all chromosomes, and found thatthe slopes were in the range of [ − . × − , − . × − ] for the autosomesand − . × − for the X chromosome. The fact that the slope for the Xchromosome is less steep than that for the autosomes quantitatively supportsthe low abundance and fewer clusters of SNPs on the X chromosome.Based on the ﬁnding that the proposed model can explain the clusteringproperty of SNPs, we further analyze the estimated parameters of the proposedmodel. Figure 3 plots estimated α versus λ using SNP data on 23 chromosomesfrom each of 11 global populations. From Fig. 3, we see that the parametersfor the X chromosome exhibits a larger variation across 11 global populationsthan those for the autosomes. Being a haploid, the smaller population size ofthe X chromosome has a faster genetic drift than the autosomes. Consequently,the population structure (or population stratiﬁcation) of the X chromosomeshould be more pronounced than that of the autosomes [21]. Given that thepopulation structure implies the presence of a systematic diﬀerence in allelicfrequencies between global populations, we expect that chromosomes havingpronounced population structure should have higher dissimilarity in their allelebetween populations, which, in turn, induces higher variability in clustered SNPdistribution. Thus, the comparatively large variation of the X chromosomeacross the populations is evidence of its pronounced population structure.We also clustered autosomes according to their estimated parameters of12 α, λ ) by using the k-means algorithm [22] with the number of clusters k = 4,suggested by the method for determining k [23]. The result of cluster analysis of22 autosomes are shown in Fig. 3. The cluster analysis result demonstrates thatchromosomes in the same cluster (for instance, chromosomes 9 and 16 in C )have similar degree of clustered SNPs in terms of their probability distributionof the SNP space.From the ﬁnding that λ ≪ α <

1, we can further approximate Eq. (9) toinvestigate the mutational non-independence. By expanding Eq. (9) to the ﬁrstorder in α , it can be approximated asln p ( s | λ, α ) ≈ λ (1 − s ) + ln λ + (1 − s + s ln s ) αλ − α ln s . (13)One way to express the mutational non-independence part is the diﬀerencebetween logarithmic probabilities of Eq. (13) and Eq. (10). By identifying β = λ ,we have ln ∆ ≡ ln p ( s | λ, α ) − ln p ( s | β ) = ( λ − λs + λs ln s ) α − α ln s . (14)Note that the non-random part depends not only on the parameters α and λ , butalso the SNP space s . Since λ ≪ s . When s ≫

1, wehave λs ≈ O (1), and ln ∆ ≈ − α . This indicates that the non-random part isinsensitive to s for s ≫

1, which implies that the mutational non-independenceis negligible. Whereas, when s ≈ O (1), we have ln ∆ ≈ − α ln s . This suggeststhat as s becomes smaller, the non-random part increases, which is consistentwith the results that can be found, for instance, in Fig. 1.To investigate the chromosome and population speciﬁc characteristics, werepresent the estimated parameters that are averaged over 11 global popula-tions for each chromosome in Fig. 4 (a), and those that are averaged over 23chromosomes for each single population in Fig. 4 (b). We found that the es-timated parameters have smaller variation across the chromosomes [Fig. 4 (a)]than across the populations [Fig. 4 (b)]. This implies that the clustering prop-erty depends more on the chromosome than on the population. In particular,13mong the variations across chromosomes, Fig. 4 (a) shows that the chromo-some X has higher variability than the autosomes whose cause was discussedbefore.Aside from the deviations, we uncovered that the two parameters are pos-itively correlated over all autosomes, whereas they are negatively correlatedover the global populations. The mutation probability of Eq. (4) increases (ordecreases) as λ (or α ) increases. This implies that, when the parameters arepositively correlated, the increase in λ is compensated for by the increase in α , and vice-verse. Thus, the positive correlation over autosomes demonstratesthat the mutation probability stays more or less the same for all autosomes.On the other hand, the negative correlation over the global populationsillustrates that the mutation rate depends on the estimated values of λ (or α )of the population. That is, as λ increases, a negatively correlated α decreases;as a result, the mutation probability of Eq. (4) increases. Thus, the larger the λ (or the smaller α ) with a single population, the higher its mutation probability.It turns out that the populations that originated from Africa have a larger λ than those from Asia as indicated in Fig. 4 (b).

4. SUMMARY AND CONCLUSION

In this study, we proposed a non-random mutation model to explain the clus-tered distribution of SNPs in the human genome. The proposed model takesinto account the dependence of the mutation probability on the space betweentwo adjacent SNPs. The proposed model was tested against Hapmap data byderiving the SNP space distribution from the model. The probability distri-bution derived from the proposed model was in good agreement with empiricaldistributions. Furthermore, the proposed model was comprehensive in the sensethat it explains the clustered distribution of SNPs not only on the autosomesbut on the X chromosome. This suggests that the observed SNP distributionsof all chromosomes are diﬀerent realization of the same model. We also showedthat the proposed model was more adequate than the random mutation model.14he parameters in the proposed model are used to demonstrate the chromosome-and population-wise diﬀerent characteristics of the SNP distribution. From thefact that the X chromosome is haploid in males, the X chromosome has a morepronounced population structure and a lower abundance in SNPs than the auto-somes. These characteristics were understood by a comparison of the parametersfor the X chromosome with those for the autosomes. The correlation between λ and α can be used to illustrate the dependence of the mutation probabilityand genetic diversities on the diﬀerent populations.This study introduces a phenomenological model for the clustered SNP dis-tribution in the human genome. Further investigations, such as the feature ofevolutionary transience of SNP clusters and phylogenetic reconstruction, aredesirable and should be carried out as a future direction of study. Acknowledgments

This work was supported by the Korea Research Foundation Grant fundedby the Korean Government (MOEHRD) (KRF-2015018708).

References [1] Barreiro,L. et al (2008) Natural selection has driven population diﬀerentia-tion in modern humans,

Nat. Genet. , , 340-345.[2] Metzker,M. (2010) Sequencing technologies-the next generation, Nat. Rev.Genet. , , 31-46.[3] LaFramboise,T. (2009) Single nucleotide polymorphism arrays: a decade ofbiological, computational and technological advances, Nucleic Acids Res. , ,4181-4193.[4] International HapMap Consortium (2005) A haplotype map of the humangenome, Nature , , 1299-320; International HapMap Consortium (2007) Asecond generation human haplotype map of over 3.1 million SNPs, Nature ,15 , 851-861; The International HapMap 3 Consortium (2010) Integratingcommon and rare genetic variation in diverse human populations, Nature , , 52-58.[5] The 1000 Genomes Project Consortium (2010) A map of human genomevariation from population-scale sequencing, Nature , , 1061-1073; The 1000Genomes Project Consortium (2012) An integrated map of genetic variationfrom 1,092 human genomes, Nature , , 56-65.[6] Amos,W. (2010) Even small SNP clusters are non-randomly distributed: isthis evidence of mutational non-independence?, Proc. R. Soc. B , , 1443-1449.[7] Tenaillon,M.I. et al (2008) Apparent mutational hotspots and long distancelinkage disequilibrium resulting from a bottleneck, J. Evol. Biol. , , 541-550.[8] Koboldt,D. et al (2006) Distribution of Human SNPs and Its Eﬀect on High-Throughput Genotyping, Hum. Mutat. , , 249-254.[9] Hellmann,I. et al (2005) Why do human diversity levels vary at a megabasescale?, Genome Res. , 1222-1231.[10] Lindblad-Toh,K. et al (2000) Large-scale discovery and genotyping ofsingle-nucleotide polymorphisms in the mouse, Nat. Genet. , , 381-386.[11] Rogozin,I.B. and Pavlov,Y.I. (2003) Theoretical analysis of mutationhotspots and their sequence context speciﬁcity, Mut. Res. , , 65-85.[12] Bubb,K.L. et al (2006) Scan of human genome reveals no new loci underancient balancing selection, Genetics , , 2165-2177.[13] Eriksson,A. et al (2002) Clustering of SNPs along a chromosome: can theneutral model be rejected? (http://arxiv.org/abs/physics/0207024)[14] Vowles,E.J. and Amos,W. (2004) Evidence for widespread convergent evo-lution around human microsatellites, PLoS Biol , , 1157-1167.1615] Webster,M.T. and Hagberg,J. (2007) Is there evidence for convergent evo-lution around human microsatellites?, Mol Biol Evol , , 1097-1100.[16] Varela,M.A. et al (2008) Heterogeneous nature and distribution of inter-ruptions in dinucleotides may indicate the existence of biased substitutionsunderlying microsatellite evolution, J Mol Evol , , 575-580.[17] Kuhner,M.K. et al (2000) Usefulness of single nucleotide polymorphismdata for estimating population parameters, Genetics , , 439-447.[18] Pitman,J. (1993) Probability. Springer Publishers, New York.[19] Kelley,C. (1995) Iterative Methods for Linear and Nonlinear Equations.SIAM, Philadelphia.[20] Li,W. et al (2002) Male-driven evolution, Curr. Opin. Genet. Dev. , ,650-656.[21] Schaﬀner,S. (2004) The X chromosome in population genetics, Nat. Rev.Genet. , , 43-51.[22] Hartigan,J.A. and Wong,M.A. (1979) Algorithm AS 136: A K-Means Clus-tering Algorithm, J Roy Stat Soc C , , 100-108.[23] Charrad,M. et al (2014) NbClust: An R Package for Determining the Rel-evant Number of Clusters in a Data Set, J Stat Softw ,61