Inferring population history with DIYABC: a user-friendly approach to Approximate Bayesian Computation
Jean-Marie Cornuet, Filipe Santos, Mark A. Beaumont, Christian P. Robert, Jean-Michel Marin, David J. Balding, Thomas Guillemaud, Arnaud Estoup
BBIOINFORMATICS
Vol. 00 no. 00 2008Pages 1–8
Inferring population history with
DI Y ABC : auser-friendly approach to Approximate BayesianComputation
Jean-Marie Cornuet , ∗ , Filipe Santos , Mark A. Beaumont , Christian P.Robert , Jean-Michel Marin , David J. Balding , Thomas Guillemaud andArnaud Estoup Department of Epidemiology and Public Health, Imperial College, St Mary’s Campus, NorfolkPlace, London W2 1PG, U.K. Centre de Biologie et de Gestion des Populations, INRA, Campus International de Baillarguet, CS30016 34988 Montferrier-sur-Lez, France School of Biological Sciences, Lyle Building, The University of Reading Whiteknights, ReadingRG6 6AS, UK CEREMADE, Universit ´e Paris-Dauphine, Place Delattre de Tassigny, 75775 Paris cedex 16,France INRIA Saclay, Projet select, Universit ´e Paris-Sud, Laboratoire de Math ´ematiques (B ˆat. 425),91400 Orsay, France UMR 1301 I.B.S.V. INRA-UNSA-CNRS. 400 Route des Chappes. BP 167 - 06903 Sophia Antipoliscedex. France
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACTSummary:
Genetic data obtained on population samples conveyinformation about their evolutionary history. Inference methods canextract part of this information but they require sophisticated sta-tistical techniques that have been made available to the biologistcommunity (through computer programs) only for simple and stan-dard situations typically involving a small number of samples. Wepropose here a computer program (
DIY ABC ) for inference basedon Approximate Bayesian Computation (ABC), in which scenarios canbe customized by the user to fit many complex situations involvingany number of populations and samples. Such scenarios involve anycombination of population divergences, admixtures and populationsize changes.
DIY ABC can be used to compare competing scena-rios, estimate parameters for one or more scenarios, and computebias and precision measures for a given scenario and known valuesof parameters (the current version applies to unlinked microsatellitedata). This article describes key methods used in the program andprovides its main features. The analysis of one simulated and onereal data set, both with complex evolutionary scenarios, illustratesthe main possibilities of
DIY ABC . Availability:
The software
DIY ABC
Supplementary material:
Contact: [email protected] ∗ to whom correspondence should be addressed Until now, most literature and software about inference in popu-lation genetics concern simple standard evolutionary scenarios: asingle population (Griffiths and Tavar´e, 1994; Stephens and Don-nelly, 2000; Beaumont, 1999), two populations exchanging genes(Hey and Nielsen, 2004; De Iorio and Griffiths, 2004) or not(Hickerson et al. , 2007) or three populations in the classic admixturescheme (Wang, 2003; Excoffier et al. , 2005). The main exception toour knowledge is the computer program
BAT W ING (Wilson etal. , 2003) which considers a whole family of scenarios in whichan ancestral population splits into as many subpopulations as nee-ded. However, in practice, population geneticists collect and analysesamples which rarely correspond to one of these standard scenarios.If they want to apply methods developed in the literature and forwhich computer programs are available, they have to select subsetsof samples (to fit these standard situations), at the price of loweringthe power of the analysis. The other solution is to develop their ownsoftware, which requires specific skills or the right collaborators.Rare examples of inference in non standard scenarios can be foundin O’Ryan et al. (1998) including 3 populations and two successivedivergences, or Estoup et al. (2004) (10 populations that sequenti-ally diverged with initial bottlenecks and exchanging migrants withneighbouring populations).Inference in complex evolutionary scenarios can be performed invarious ways, but all are based on the genealogical tree of sampledgenes and coalescence theory. A first approach used in programssuch as IM (Hey and Nielsen, 2004) or BAT W ING consists of c (cid:13) Oxford University Press 2008. a r X i v : . [ q - b i o . P E ] S e p ornuet et al. starting from a gene genealogy compatible with the observed dataand exploring the parameter and genealogy space through MCMCalgorithms. One difficulty with this approach is to be sure thatthe MCMC has converged, because of the huge dimension of theparameter space. With a complex scenario, the difficulty is incre-ased. Also, although not impossible, it seems quite challenging towrite a program that would deal with very different scenarios. Asecond approach pioneered by Beaumont (2003) consists in com-bining MCMC exploration of the scenario parameter space with anImportance Sampling (IS) based estimation of the likelihood. Thestrength of this approach is that the low number of parameters ensu-res a (relatively) fast convergence of the MCMC. Its weakness is thatthe likelihood is only approximated through IS, sometimes resultingin poor acceptance rates.When dealing with complex situations, the two previous approa-ches raise difficulties which mainly stem in the computation of thelikelihood. Consequently, a line of research including the works ofTavar´e et al. (1997), Weiss and von Haeseler (1998), Pritchard et al. (1999) and Marjoram et al. (2003) developed a new approach ter-med Approximate Bayesian Computation (or ABC) by Beaumont et al. (2002). In this approach, the likelihood criterion is replacedby a similarity criterion between simulated and observed data sets,similarity usually measured by a distance between summary stati-stics computed on both data sets. Among examples of inference incomplex scenarios given above, all but one (the simplest) have usedthis approach, showing that it can indeed solve complex problems.The ABC approach presents two additional features that can be ofinterest for experimental biologist. One characteristic, already notedby Excoffier et al. (2005), is the possibility to assess the bias andprecision of estimates for simulated data sets produced with knownvalues of parameters with little extra computational cost. To get thesame information with likelihood-based methods would require ahuge amount of additional computation whereas, with ABC, the lar-gest proportion of computation used for estimating parameters canbe recycled in a bias/precision analysis. The second feature is thesimple way by which the posterior probability of different scena-rios applied to the same data set can be estimated (e.g. Miller et al. ,2005; Pascual et al. , 2007).In its current state, the ABC approach remains inaccessible tomost biologists because there is not yet a simple software solution.Therefore, we developed the program DIY ABC that performsABC analyses on complex scenarios, i.e. which include any num-ber of populations and samples (samples possibly taken at differenttimes), with populations related by divergence and/or admixtureevents and possibly experiencing changes of population size. Thecurrent version is restricted to unlinked microsatellite data. In thisarticle, we describe the rationale for some methods involved in theprogram. Then we give the main features of
DIY ABC and we pro-vide two complete example analyses performed with this programto illustrate its possibilities.
Inference about the posterior distribution of parameters in an ABCanalysis is usually performed in three steps (see Figure S1 in Sup-plementary material). The first one is a simulation step in which avery large table (the reference table ) is produced and recorded. Eachrow corresponds to a simulated data set and contains the parameter values used to simulate the data set and summary statistics computedon the simulated data set. Parameter values are drawn from prior dis-tributions. Using these parameter values, genetic data are simulatedas explained in the next section. The summary statistics correspondto those traditionally used by population geneticists to characterizethe genetic diversity within and among samples (e.g. number of alle-les, genic diversity and genetic distances). The idea is to extractmaximum genetic information from the data, admitting that exhau-stivity or sufficiency are generally out of reach. The simulationstep is generally the most time-consuming step since the numberof simulated data sets can reach several millions. The second stepis a rejection step. Euclidian distances between each simulated andthe observed data set in the space of summary statistics are com-puted and only the simulated data sets closest to the observed dataset are retained. The parameter values used to simulate these selec-ted data sets provide a sample of parameter values approximatelydistributed according to their own posterior distribution. Beaumont et al. (2002) have shown that a local linear regression (third step= estimation step) provides a better approximation of the posteriordistribution.This synoptic of ABC is well established and we now concentrateon more specific issues that are implemented in
DIY ABC . Thanks to coalescence theory, it has become easy to simulate datasets by a two steps procedure. The first step consists of buildinga genealogical tree of sampled genes according to rather simplerules provided by coalescence theory (see below). The second stepconsists of attributing allelic states to all the nodes of the genea-logy, starting from the common ancestor and simulating mutationsaccording to the mutation model of the genetic markers. In a com-plex scenario, only the first step needs special attention and we willconcentrate on it now.In a single isolated population of constant effective size, thegenealogical tree of a sample of genes is simulated backward intime: starting from the time of sampling, the gene lineages aremerged (coalesced) at times that are drawn from an exponential dis-tribution with rate j ( j − / N e , when there are j distinct lineagesand the (diploid) effective population size is N e . The genealogicaltree is completed when there remains a single lineage.Consider now two isolated populations (effective population sizes N and N respectively) that diverged t d generations before theircommon sampling time. Since the two populations do not exchangegenes, lineages within each population will coalesce independently.Coalescence simulation will stop either when there remains a singlelineage or when the simulated time is beyond the divergence (loo-king back in time). In the latter case, the coalescence event is simplydiscarded. At generation t d , the remaining lineages are simply poo-led and will coalesce in the ancestral population. Because of thememoryless property of the exponential distribution, the time to thefirst coalescence in the ancestral population is independent of thetimes of the last coalescence in each daughter population and canbe simulated as in the single isolated population above. Again, thegenealogical tree is completed when there remains a single lineagein the ancestral population. Note that the two populations need notbe sampled at the same generation since this has no bearing on thesimulation process. IY ABC : a program for inferring population history
Consider now the classic admixture scenario with one admixedand two parental populations, as in Figure 1 in Excoffier et al. (2005). Simulating the complete genealogical tree can be achievedwith the following steps: i) coalesce gene lineages in each popu-lation independently until reaching admixture time, ii) distributeremaining lineages of the admixed population among the two paren-tal populations, each with a Bernoulli draw with probability equalto the admixture rate, iii) coalesce gene lineages in the two parentalpopulations until reaching their divergence time, iv) pool the remai-ning gene lineages of the two parental populations and place theminto the ancestral population and v) coalesce gene lineages in theancestral population.We first note the modular form of this algorithm which involvesonly three modules:1. a module that performs coalescences in an isolated constantsize population between two given times,2. a module that pools gene lineages from two populations (fordivergence),3. a module that splits gene lineages from the admixed populationbetween two parental populations (for admixture).We also note that the last two modules are quite simple and thatthe first one might be extended to include population size variations.We have introduced a fourth module that proves useful in manyinstances. It performs the (simple) task of adding a gene sample toa population at a given generation. The interest of this module is toallow for multiple samples of the same population taken at differentgenerations. By combining the aforementioned four modules, it ispossible to simulate genetic data involving any number of populati-ons according to a scenario that can include divergence, admixtureevents as well as population size variations. In addition, populati-ons can be sampled more than once at different times. Comparedto our previous definition of complex scenarios, the only restric-tion so far concerns the absence of migrations among populations.If migrations have to be taken into account, coalescences in two (ormore) populations exchanging migrants are no longer independentand should be treated in the same module. Such a module wouldrequire to consider simultaneously two kind of events, coalescencesof lineages within population and migrations of gene lineages fromone population to another. In the current stage of
DIY ABC , thishas not yet been achieved.
Simulating coalescences can be performed in two ways. The mosttraditional way is based on the usually fulfilled assumption that theeffective population size is large enough so that the probability ofcoalescence is small and hence that the probability that two or morecoalescences occur at the same generation is low enough so that itcan be neglected (e.g. Nordborg, 2007). Time is then consideredas a continuous variable in computations. The corresponding algo-rithm, called here the continuous time (CT) algorithm, consists indrawing first times between two successive coalescence events andthen drawing 2 lineages at random at each coalescence event.However, in practice, population size can be so small (e.g. duringa bottleneck) that multiple coalescences at the same generationbecome common, including with the same parental gene (producingmultifurcating trees). Simulating gene genealogies with multiple coalescences is possible, (e.g. Laval and Excoffier, 2004). In effect,lineages are reconstructed one generation at a time: lineages exi-sting at generation g are given a random number drawn in U [1 , N e ] and lineages with the same number coalesce together. The latter istermed here the generation by generation (GbG) algorithm.The CT algorithm is much faster in most cases and is used in mostsoftwares, but in some circumstances, the approximation becomesunacceptable. The solution taken in DIY ABC is to swap betweenthe two algorithms according to a criterion based on the effectivepopulation size, the time during which the effective size keeps itsvalue, and the number of lineages at the start of the module. Thecriterion is such that the generation per generation (GbG) algorithmis taken whenever it is faster (this occurs when the effective size isvery small) or when the continuous time (CT) algorithm overesti-mates by more than 5% on average the number of lineages at theend of the module.A specific comparison study has been performed to establishthis criterion. For different time periods counted in number ofgenerations ( g ), effective population sizes ( N e ) and number of ente-ring lineages ( n el ), coalescences have been simulated according toeach algorithm 10,000 times and the average number of remaininglineages at the end of the period have been recorded as well asthe average computation duration of each algorithm. Our results(Figure S2) show that the following rules optimize computationspeed while keeping the relative bias in coalescence rates under the5% threshold:if ( < g ≤ ) do CT if n el /N e < . g − . g +0 . else do GbGif ( < g ≤ ) do CT if n el /N e < . g + 1 . else do GbGif ( < g ) do CT if n el /N e < else do GbG Using ABC to compare different scenarios and infer their poste-rior probability has been performed in two ways in the literature.Starting with a reference table containing parameters and summarystatistics obtained with the different scenarios to be compared (orpooling reference tables, each obtained with a given scenario), datasets are ordered by increasing distance to the observed data set. Afirst method (termed hereafter the direct approach) is to take as anestimate of the posterior probability of a scenario the proportion ofdata sets obtained with this scenario in the n δ closest data sets (Mil-ler et al. , 2005; Pascual et al. , 2007). The value of n δ is arbitraryand unless the results are quite clear cut, the estimated posteriorprobability may vary with n δ .Following the same rationale that introduced the local linearregression in the estimation of posterior distributions for parame-ters (Beaumont et al. , 2002), we perform a weighted polychotomouslogistic regression to estimate the posterior probability of scenarios,termed hereafter the logistic approach (see also Fagundes et al. ,2007; Beaumont, 2008). In the estimation of parameters, a linearregression is performed with dependent variable the parameter andpredictors the differences between the observed and simulated sta-tistics. This linear regression is local at the point (in the predictorspace) corresponding to the observed data set, using an Epanechni-kov kernel based on the distance between observed and simulatedsummary statistics (see formula (5) in Beaumont et al. , 2002). Para-meters values are then replaced by their estimates at that point in theregression. ornuet et al. Keeping the differences between observed and simulated stati-stics as the predictor variables in the regression, we consider now theposterior probability of scenarios as the dependent variable. Becauseof the nature of the ”parameter”, an indicator of the scenario num-ber, a logit link function is applied to the regression. The localaspect of the regression is obtained by taking the same weights as inthe linear adjustment of parameter values as described in Beaumont et al. (2002). Confidence intervals for the posterior probabilitiesof scenarios are computed through the limiting distribution of themaximum likelihood estimators. See Annex 1 in Supplementarymaterial for a detailed explanation.
In order to measure bias and precision, we need to simulate data sets(i.e. test data sets) with known values of parameters and compareestimates with their true values. In the ABC estimation procedure,the most time-consuming task is to produce a large enough referencetable. However, when such a reference table has been produced, e.g.for the analysis of a real data set, it can also be used to quantify biasand precision on test data sets as well.Measuring bias is straightforward, but precision can be assessedwith different measures. In
DIY ABC , the latter include the rela-tive square root of the mean square error, the relative square root ofthe mean integrated square error, the relative mean absolute devia-tion, the 95 and 50% coverages and the factor 2. See Annex 2 inSupplementary material for more details.
DIY ABC is a program that performs ABC inference on popu-lation genetic data. In its current state, the data are genotypes atmicrosatellite loci of samples of diploid individuals (missing dataare allowed). The inference bears on the evolutionary history of thesampled populations by quantifying the relative support of data topossible scenarios and by estimating posterior densities of associa-ted parameters.
DIY ABC is a program written in Delphi runningunder a 32-bit Windows operating system (e.g. Windows XP) and ithas a user-friendly graphical interface.The program accepts complex evolutionary scenarios involvingany number of populations and samples. Scenarios can include anynumber of the following timed events: stepwise change of effec-tive population size, population divergence and admixture. Theycan also include unsampled as well as serially sampled populati-ons as in Beaumont (2003). The main restriction regarding scenariocomplexity is the absence of migrations between populations.Since the program has been written for microsatellite data, itproposes two mutation models, namely the stepwise mutationmodel (SMM) and the generalized stepwise mutation (GSM) model(Estoup et al. , 2002). Note that the same mutation model has to beapplied to all microsatellite loci, but these may have different valuesof mutation parameters.The historico-demographic parameters of scenarios may be ofthree types: effective sizes, times of events (in generations) and admixture rates. Marker parameters are mutation rates and the coef-ficient of the geometric distribution (under the GSM only). Theprogram can also estimate composite parameters such as θ = 4 N e µ and τ = tµ , with N e being the diploid effective population size, t the time of an event and µ the mean mutation rate. Prior distri-butions are defined for original parameters and those for compositeparameters are obtained via an assumption of independence of theircomponent prior densities. Priors for historico-demographic para-meters can be chosen among four common distributions: Uniform,Log-uniform, Normal and Log-normal. Users can set minimum andmaximum (for all distributions) and mean and standard deviation(for Normal and Log-normal). In addition, priors can be modifiedby setting binary conditions ( > , < , ≥ and ≤ ) on pairs of parame-ters of the same category (two effectives sizes or two times of event).This is especially useful to control the relative times of events whenthese are parameters of the scenario. For priors of mutation parame-ters, only the Uniform and the Gamma distributions are considered,but hierarchical schemes are possible, with a mean mutation rateor coefficient P (of the geometric distribution in the GSM) drawnfrom a given prior and individual loci parameter values drawn froma gamma distribution around the mean.Available summary statistics are usual population genetic stati-stics averaged over loci: e.g. mean number of alleles, mean genicdiversity, F st, ( δµ ) , admixture rates ...Regarding ABC computations, the program can i) create a refe-rence table or append values to an existing table, ii) compute theposterior probability of different scenarios, iii) estimate the poste-rior distributions of original and/or composite parameters for one ormore scenarios and iv) compute bias and precision for a given sce-nario and given values of parameters . Finally, the program can beused simply to simulate data sets in the popular Genepop format(Raymond and Rousset, 1995).
DIY ABC
In order to illustratethe capabilities of
DIY ABC , we take first an example based on adata set simulated according to a complex scenario including threesplits and two admixture events (scenario 1 in Figure 1). The sce-nario includes six populations: two of them have been sampled attime 0, the third one at time 2 and the fourth one at time 4, thelast two have not been sampled. Each population sample includes30 diploid individuals and data are simulated at 10 microsatelliteloci. This scenario is not purely theoretical as it could be applied forinstance to European populations of honeybees in which the Italianpopulations (
Apis mellifera ligustica result from an ancient admix-ture between two evolutionary branches (Franck et al. , 2000) thatwould correspond here to population samples 1 and 4. Furthermore,in the last 50 years, Italian bees have been widely exported and sam-ple 2 could well correspond to a population of a parental branchthat has been recently introgressed by Italian queens. This exam-ple also stresses the ability of
DIY ABC to distinguish two eventsthat are confounded in the usual admixture scheme: the admixtureevent itself and the time at which the real parental populations in theadmixture diverged from the population taken as ”parental”.Our ABC analysis will address the following questions: 1) Sup-pose that we are not sure that the scenario having produced ourexample data set does include a double admixture and that we want IY ABC : a program for inferring population history to challenge this double admixture scenario with two simpler sce-narios, one with a single admixture (scenario 2 in Figure 1) and theother with no admixture at all (scenario 3). What is the posterior pro-bability of these three scenarios, given identical prior probabilities ?2) What are the posterior distributions of parameters, given that theright scenario is known ? and 3) What confidence can we have inthe posterior probabilities of scenarios and posterior distributions ofparameters ?First, a reference table is built up. Using different screens of
DIY ABC , i) the three scenarios are coded and prior distributi-ons of parameters are defined (Figure S3), ii) based on previousstudies (e.g. Dib et al. , 1994 in Pascual et al. , 2007), the Genera-lized Stepwise Mutation model is selected and prior distributionsof mutation parameters are defined (Figure S4), iii) motif sizes andallele ranges of loci are set (Figure S5) and iv) summary statisticsare selected (Figure S6). After some hours, a reference table with 6million simulated data sets (i.e. 2 million per scenario) is produced.
Fig. 1.
First example: the three evolutionary scenarios. The data set used asan example has been simulated according to scenario 1 (left). The parametervalues were the following: all populations had an effective (diploid) sizeof 1,000, the times of successive events (backward in time) were t =10, t =500, t =10,000, t =20,000 and t =200,000, the two admixture rateswere r =0.6 and r =0.4. Scenario 1 includes 6 populations, the four thathave been sampled and two left parental populations in the admixture events.Scenario 2 and 3 include 5 and 4 populations respectively. Samples 3 and 4have been collected 2 and 4 generations earlier than the first two samples,hence their slightly upward locations on the graphs. Time is not at scale. To answer the first question, the n δ =60,000 (1%) simulated datasets closest to the pseudo-observed data set are selected for thelogistic regression and n δ =600 (0.01%) for the direct approach.The answer appears in two graphs (upper row in Figure S7). Bothapproaches are congruent and show that scenario 1 is significantlybetter supported by data than any other scenarios.To answer the second question, scenario 1 is chosen and poste-rior distributions of parameters are estimated taking the 20,000 (1%)closest simulated data sets, after applying a logit transformation ofparameter values. Here again, the output is mostly graphical. Eachgraph provides the prior and posterior distributions of the corre-sponding parameter (Figure S8). Below each graph are given themean, median and mode as well as four quantiles (0.025, 0.05, 0.95and 0.975) of the posterior distribution (Table S1 in Supplementarymaterial). Since the true values are known, we can remark that someparameters are rather well estimated with peaked posteriors such asthe common effective population size and the two admixture rateswhilst other including all time parameters suggest that data are notvery informative for them. Very similar results (not shown) have been obtained with 5,000 and 40,000 simulated data sets selected forthe local linear regression, as well as when using a smaller referencetable (1 million data sets).To evaluate the confidence that can be put into the posterior pro-bability of scenarios, 500 test data sets were simulated with eachscenario and known parameter values (i.e. the same values as thoseused to produce the original data set). Posterior probabilities of thethree scenarios were estimated as above and used to compute type Iand II errors in the choice of scenario. Results show that scenario 3is always rightly chosen or excluded. Consequently type I error forscenario 1 is identical to type II error for scenario 2 and vice-versa.For scenario 1, type I errors amount to 0.414 and 0.3 for the directapproach and the logistic regression respectively whereas type IIerrors amount to 0.014 and 0.020 (cf Figures S9, S10 and S11 fordetailed distributions of scenario probabilities). The 500 test datasets simulated with scenario 1 have also been used to estimate poste-rior distributions of parameters, taking the same proportion (1%) ofclosest simulated data sets as above. Relative biases and dispersionmeasures are given in Table S2 (upper part). It is clear that severalparameters are biased and/or dispersed, the worst case being that ofparameter t . The bias is undoubtedly related to the lack of infor-mation in the data, so that point estimates are drawn towards themean values of prior distributions. The effect of prior distributionis also illustrated in the lower part of Table S2 which provides thesame measures, but obtained with different prior distributions foreffective size and time of event parameters. Our second example con-cerns populations of the Silvereye,
Zosterops lateralis lateralis (Estoup and Clegg, 2003). During the 19th and 20th century,this bird colonised Southwest Pacific islands from Tasmania. Theimportance of serial founder events in the microevolution of thisspecies has been questioned in a study based on a six microsatelliteloci data set (Clegg et al , 2002).Our analysis with
DIY ABC differs by at least four aspectsfrom the initial ABC analysis processed from the same data set byEstoup and Clegg (2003). First all island populations are treatedhere in the same analysis whereas, for tractability reasons, inde-pendent analyses were made using all pairs of populations. Second,the initial treatment was based on the algorithm of Pritchard et al. (1999), whereas
DIY ABC uses the local linear regression methodof Beaumont et al. (2002), which allows a larger number of statistics(see below) and hence makes a better use of data. Third, we havechosen here non-informative flat priors for all demographic parame-ters. Fourth, because
DIY ABC is able to treat samples collected atdifferent times, we did not have to pool samples collected at diffe-rent years from the same island and average sample year collectionover islands. We hence end up with a colonization scenario invol-ving five populations and seven samples, two samples having beencollected at different times in two different islands (Figure 2). Thesequence and dates of colonisation by silvereyes to New Zealand(South and North Island) and Chatham and Norfolk islands havebeen historically documented. This allows the times for the putativepopulation size fluctuation events in the coalescent gene trees to befixed, thus limiting the number of parameters. Our scenario was spe-cified by six unknown demographic parameters: the stable effectivepopulation size ( N S ) and the duration of the initial bottleneck ( D B ),both assumed to be the same in all islands and potentially differenteffective number of founders in Norfolk, Chatham and South and ornuet et al. Fig. 2.
Second example: screenshot of the scenario used in the analysis ofthe
Zosterops lateralis lateralis data set. In 1830,
Z. l. lateralis colonised theSouth Island of New Zealand (Pop2) from Tasmania (Pop1). In the followingyears, the population began expanding and dispersing, and reached the NorthIsland by 1856 (Pop3). Chatham Island (Pop4) was colonised in 1856 fromthe South Island, and Norfolk Island (Pop5) was colonised in 1904 from theNorth Island (historical information reviewed in Estoup and Clegg 2003).Sample collection times are 1997 for Tasmania (Sa1), South and North islandof New Zealand (Sa2 and Sa3, respectively), Chatham island (Sa4) and Nor-folk island (Sa5), 1994 for the second sample from Norfolk (Sa6), and 1992for the second sample from the North island of New Zealand (Sa7). Splittingevents and sampling dates in years were translated in number of generationssince the most recent sampling date by assuming a generation time of threeyears (Estoup and Clegg 2003). We hence fixed t1, t2, t3 and t4 to 31, 47, 47and 56 generations, respectively.
North island of New Zealand ( N F , N F , N F and N F , respec-tively). As in Estoup and Clegg (2003), we also assumed that allpopulations evolved as totally isolated demes after the date of colo-nisation.We chose uniform priors U [300 , for N S , U [2 , for all N Fi and U [1 , for D B . Prior information regarding the mutationrate and model for microsatellites was the same as in the previousexample. Summary statistics included the mean number of alle-les, the mean genic diversity (Nei, 1987), the mean coefficient M (Garza and Williamson, 2001), F st between pairs of populationsamples (Weir and Cockerham, 1984), and the mean classificationindex , also called mean individual assignment likelihood (Pascual et al. , 2007). We produced a reference table with 1 million simula-ted data sets and estimated parameter posterior distributions takingthe 10,000 (1%) simulated data sets closest to the observed data setfor the local linear regression, after applying a logit transforma-tion to parameter values. Similar results were obtained when takingthe 2,000 to 20,000 closest simulated data sets and when using a log or log − tangent transformation of parameters as proposed inEstoup et al. (2004) and Hamilton et al. (2005) (options available in DIY ABC ).Results for the main demographic parameters are presented inTable 1. They indicate the colonization by a small number of Parameter mean median mode Q . Q . st. deviation N S N F
19 18 16 9 33 8.7 N F
202 173 108 55 435 118 N F
197 168 112 55 430 116 N F
293 288 278 129 470 105
Table 1.
Second example : mean, median, mode, quantiles and standarddeviation of posterior distribution samples for effective population sizes(original parameters) for the
Zosterops lateralis lateralis data set. . . . . log10(Bottleneck intensity) den s i t y Fig. 3.
Second example: posterior distributions of the bottleneck severity(see definition in text) for the invasions of four Pacific islands by
Zoste-rops lateralis lateralis . The four discontinuous lines with small dashes, dots,dash-dots and long dashes correspond to Norfolk, Chatham, North Islandand South Island of New Zealand, respectively. The continuous line corre-sponds to the prior distribution, which is identical for each island. This graphhas been made with the locfit function of the R statistical package (Ihaka andGentleman, 1996), using an option of
DIY ABC which saves the sampleof the parameter values adjusted by the local linear regression (Beaumont etal. , 2002). founders and/or a slow demographic recovery after foundationfor Norfolk island only (median N F value of 18 individuals).Other island populations appear to have been founded by silvereyeflocks of larger size and/or have recovered quickly after founda-tion. In agreement with this, the bottleneck severity (computed as BS i = D B × N S /N Fi ) was more than one order of magnitude largerfor the population from Norfolk than for other island populations(Figure 3). These results are in the same vein as those obtained byEstoup and Clegg (2003) and agree with their main conclusions.Discrepancies in parameter estimation are observed however (e.g.larger N S values and more precise inferences for N F , N F and IY ABC : a program for inferring population history N F in the present treatment). This was expected due to the dif-ferences in the methodological design underlined above. With thepossibility of treating all population samples together, DIY ABC allows a more elaborate and satisfactory treatment compared toprevious analyses (Estoup and Clegg, 2003; Miller et al. , 2005).
So far, the ABC approach has remained inaccessible to mostbiologists because of the complex computations involved. With
DIY ABC , non specialists can now perform ABC-based inferenceon various and complex population evolutionary scenarios, withoutreducing them to simple standard situations, and hence making abetter use of their data. In addition, this programs also allows themto compare competing scenarios and quantify their relative supportby the data. Eventually, it provides a way to evaluate the amount ofconfidence that can be put into the various estimations. The mainlimitations of the current version of
DIY ABC are the assumedabsence of migration among populations after they have divergedand the mutation models which mostly refer to microsatellite loci.Next developments will aim at progressively removing these limita-tions.
FUNDING
The development of
DIY ABC has been supported by a grant fromthe French Research National Agency (project
MISGEP OP ) andan EU grant awarded to JMC as an EIF Marie-Curie Fellowship(project
StatInfP opGen ) that allowed him to spend two yearsin D.J.B.’s Epidemiology and Public Health department at ImperialCollege (London, UK) where he wrote a major part of this program.
ACKNOWLEDGEMENTS
We are most grateful to Sonya Clegg for authorizing the use ofSilvereye microsatellite data in our second example.
REFERENCES
Beaumont, M.A., 1999. Detecting population expansion and declineusing microsatellites.
Genetics , , 2013-2029.Beaumont, M.A., 2003. Estimation of population growth or declinein genetically monitored populations, Genetics , , 1139-1160.Beaumont, M.A., 2008. Joint determination of topology, divergencetime, and immigration in population trees. In Simulation, Gene-tics, and Human Prehistory, eds. S. Matsumura, P. Forster, C.Renfrew. McDonald Institute Press, University of Cambridge ( inpress ).Beaumont, M. A., W. Zhang and D. J. Balding, 2002. Appro-ximate Bayesian Computation in Population Genetics. Genetics , 2025-2035.Bertorelle, G. and L. Excoffier, 1998. Inferring admixture propor-tion from molecular data.
Mol. Biol. Evol. , 1298-1311.Clegg S.M., S.M. Degnan, J. Kikkawa, D. Moritz, A. Estoup andI.P.F. Owens, 2002. Genetic consequences of sequential founder events by an island colonising bird. Proc. Nat. Acad. Sc. , 8127-8132.De Iorio, M. and R.C. Griffiths, 2004. Importance sampling oncoalescence histories. ii: subdivided population models. Advanc.Appl. Probab. , 434-454.Estoup, A., P. Jarne and J.M. Cornuet, 2002. Homoplasy andmutation model at microsatellite loci and their consequences forpopulation genetics analysis. Mol. Ecol. , , 1591-1604.Estoup, A. and S. M. Clegg, 2003. Bayesian inferences on therecent island colonization history by the bird Zosterops lateralislateralis . Mol. Ecol. : 657-674.Estoup, A., M.A. Beaumont, F. Sennedot, C. Moritz and J.M. Cor-nuet, 2004. Genetic analysis of complex demographic scenarios:spatially expanding populations of the cane toad, Bufo marinus . Evolution , , 2021-2036.Franck, P., L. Garnery, G. Celebrano, M. Solignac and J.M. Cornuet,2000. Hybrid origins of honeybees from Italy ( Apis melliferaligustica ) and Sicily (
A. m. sicula ). Mol. Ecol. , , 907-92.Excoffier, L., A. Estoup and J.M. Cornuet, 2005. Bayesian analy-sis of an admixture model with mutations and arbitrarily linkedmarkers. Genetics , 1727-1738.F
AGUNDES , N.J.R., N. R AY , M.A. B EAUMONT , S. N
EUEN - SCHWANDER , F. S
ALZANO , S.L. B
ONATTO AND
L. E
XCOF - FIER , 2007. Statistical evaluation of alternative models of humanevolution.
Proc. Natl. Acad. Sc. , : 17614-17619.Garza JC and E Williamson, 2001. Detection of reduction inpopulation size using data from microsatellite DNA. Mol. Ecol. ,305-318.Griffiths, R.C. and S. Tavar´e, 1994. Simulating probability distribu-tions in the coalescent. Theor. Pop. Biol. , 131-159.Hamilton, G., M. Stoneking and L. Excoffier, 2005. Molecular ana-lysis reveals tighter social regulation of immigration in patrilocalpopulations than in matrilocal populations. Proc. Natl. Acad. Sci.USA , , 7476-7480.Hey, J. and R. Nielsen, 2004. Multilocus methods for estimatingpopulation sizes, migration rates and divergence time, with app-lications to the divergence of Drosophila pseudoobscura and
D.persimilis . Genetics , , 747-760.Hickerson, M.J., E. Stahl and N. Takebayashi, 2007. msBayes:Pipeline for testing comparative phylogeographic histories usinghierarchical approximate Bayesian computation. BMC Bioinfor-matics , , 268-274.Ihaka R. and R. Gentleman, 1996. R : a language for data analysisand graphics. J. Comput. Graph. Stat. , , 299-314Laval, G., and L. Excoffier, 2004. SIMCOAL 2.0: a program tosimulate genomic diversity over large recombining regions in asubdivided population with a complex history. Bioinformatics , : 2485-2487.Marjoram, P., J. Molitor, V. Plagnol and S. Tavar´e, 2003. Markovchain Monte Carlo without likelihood. Proc. Natl. Acad. Sc. , ,15324-15328.Miller N, A. Estoup, S. Toepfer, D Bourguet, L. Lapchin, S. Derridj,K.S. Kim, P Reynaud, F. Furlan and T. Guillemaud, 2005. Mul-tiple Transatlantic Introductions of the Western Corn Rootworm. Science , , p. 992Nei M., 1987. Molecular Evolutionary Genetics . Columbia Univer-sity Press, New York, 512 pp.Nordborg, M., 2007. Coalescent theory p843-877 in
Handbookof Statistical Genetics
Ed. by D.J. Balding, M. Bishop and C. ornuet et al. Cannings, 3rd ed., Wiley & Sons, Chichester, U.K.O’Ryan, C., M.W. Bruford, M. Beaumont, R. K. Wayne, M. I.Cherry and E. H. Harley, 1998. Genetics of fragmented populati-ons of African buffalo (
Syncerus caffer ) in South Africa.
AnimalConservation , , 85-94.Pascual, M., M.P. Chapuis, F. Mestres, J. Balany´a, R.B. Huey, G.W.Gilchrist, L. Serra and A. Estoup, 2007. Introduction history of Drosophila subobscura in the New World: a microsatellite basedsurvey using ABC methods.
Mol. Ecol. , , 3069-3083.Pritchard, J., M. Seielstad, A. Perez-Lezaun and M. Feldman, 1999.Population growth of human Y chromosomes: a study of Ychromosome microsatellites. Mol. Biol. Evol. , 1791-1798.Raymond M., and F. Rousset, 1995. Genepop (version 1.2), popula-tion genetics software for exact tests and ecumenicism. J. Hered. , , 248-249 Stephens, M. and P. Donnelly, 2000. Inference in molecular popula-tion genetics (with discussion). J. R. Stat. Soc. B , 605-655.Tavar´e S., D.J. Balding, R.C. Griffiths and P. Donnelly, 1997. Infer-ring coalescence times from DNA sequences. Genetics , ,505-518.Wang, J., 2003. Maximum-likelihood estimation of admixtureproportions from genetic data. Genetics , , 747-765.Weir B.S. and C.C. Cockerham , 1984. Estimating F-statistics forthe analysis of population structure. Evolution : 1358-1370.Weis, G. and A. von Haeseler, 1998. Inference of population historyusing a likelihood approach. Genetics , , 1539-1546.Wilson, I.J., M.E. Weale and D.J. Balding, 2003. Inferencesfrom DNA data: population histories,evolutionary processes, andforensic match probabilities. J. R. Stat. Soc. A , 155-187., 155-187.