High performance computation of landscape genomic models integrating local indices of spatial association
Sylvie Stucki, Pablo Orozco-terWengel, Michael W. Bruford, Licia Colli, Charles Masembe, Riccardo Negrini, Pierre Taberlet, Stéphane Joost, NEXTGEN Consortium
HHigh performance computation of landscape genomicmodels integrating local indices of spatial association
Sylvie Stucki , Pablo Orozco-terWengel , Michael W. Bruford , Licia Colli ,Charles Masembe , Riccardo Negrini , Pierre Taberlet , Stéphane Joost ∗ , andthe NEXTGEN Consortium Laboratory of Geographic Information Systems (LASIG), School of Architecture, Civil andEnvironmental Engineering (ENAC), Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015Lausanne, Switzerland School of Biosciences, Sir Martin Evans Building, Cardiff University, Cardiff, UK Istituto di Zootecnica and BioDNA - Centro di Ricerca sulla Biodiversità e sul DNA Antico,Università Cattolica del S. Cuore, via E. Parmense 84, 29100 Piacenza, Italy Makerere University, Institute of Environment and Natural Resources, Molecular BiologyLaboratory, Kampala, Uganda Associazione Italiana Allevatori, Roma, Italy CNRS, Laboratoire d’Ecologie Alpine, Grenoble, France Univ. Grenoble Alpes, Laboratoire d’Ecologie Alpine, Grenoble, France http://nextgen.epfl.ch Abstract
Since its introduction, landscape genomics has developed quickly with the in-creasing availability of both molecular and topo-climatic data. The current chal-lenges of the field mainly involve processing large numbers of models and disen-tangling selection from demography. Several methods address the latter, either byestimating a neutral model from population structure or by inferring simultaneouslyenvironmental and demographic effects. Here we present Sam β ada, an integratedapproach to study signatures of local adaptation, providing rapid processing of wholegenome data and enabling assessment of spatial association using molecular markers.Specifically, candidate loci to adaptation are identified by automatically assessinggenome-environment associations. In complement, measuring the Local Indicatorsof Spatial Association (LISA) for these candidate loci allows to detect whether sim-ilar genotypes tend to gather in space, which constitutes a useful indication of thepossible kinship relationship between individuals. In this paper, we also analyze SNPdata from Ugandan cattle to detect signatures of local adaptation with Sam β ada,BayEnv, LFMM and an outlier method (FDIST approach in Arlequin) and comparetheir results. Sam β ada is an open source software for Windows, Linux and MacOSX available at http://lasig.epfl.ch/sambada . ∗ to whom correspondence should be addressed a r X i v : . [ q - b i o . P E ] N ov Introduction
In the late 1970s, Mitton et al. (1977) first had the idea to correlate the frequency ofalleles with an environmental variable to look for signatures of selection in ponderosapine. Thirty years later, Joost et al. ’s (2007; 2008) developed the same concept inorder to allow parallel processing of large numbers of logistic regressions. Since then,no noticeable progress was observed in the development of the correlative approachesused in the emerging field of landscape genomics until recently. During this periodcorrelative approaches were used in parallel with population genetics outlier-detectionmethods (e.g. Beaumont and Nichols, 1996; Vitalis et al. , 2003; Foll and Gaggiotti, 2008)as cross-validation (e.g. Jones et al. , 2013; Henry and Russello, 2013) to detect signaturesof selection (see a review in Vitti et al. , 2013). However, while such methods are stillin vogue (e.g. Colli et al. , 2014; Lv et al. , 2014), there has been a revival in the interestof developing new statistical approaches for the emerging field of landscape genomics(e.g. Coop et al. , 2010; Günther and Coop, 2013; Frichot et al. , 2013; Guillot et al. ,2014). For example, BayEnv (Günther and Coop, 2013) implements a Bayesian methodto compute correlations between allele frequencies and ecological variables taking intoaccount differences in sample sizes and shared demographic history. LFMM (Frichot et al. , 2013) estimates the influence of population structure on allele frequencies byintroducing unobserved variables as latent factors. Finally, SGLMM (Guillot et al. ,2014) uses a spatially-explicit computational framework including a random effect toquantify the correlation between genotypes and environmental variables. Yet, importantfunctions are still lacking such as high performance computing capacity to process wholegenome data, and the integration of spatial statistics to support a distinction betweenselection and demographic signals. Here we present the software Sam β ada, which aimsat filling these gaps by offering an open source multivariate analysis framework to detectsignatures of selection. Sam β ada’s use is illustrated with a case study dedicated to thedetection of potentially adaptive loci in 813 Bos taurus and
Bos indicus individuals inUganda genotyped for ∼ β ada’s performance is described withrespect to other state of the art software to detect signatures of selection. Sam β ada uses logistic regressions to model the probability of presence of an allelic vari-ant in a polymorphic marker given the environmental conditions of the sampling loca-tions (Joost et al. , 2007). Since each of the states of a given character is consideredindependently (i.e. as binary presence/absence in each sample), Sam β ada can handlemany types of molecular data (e.g. SNPs, indels, copy number variants and haplotypes),provided the user formats the input. Specifically, biallelic SNPs are recoded as threedistinct genotypes. A maximum likelihood approach is used to fit the models (Dobsonand Barnett, 2008).In the univariate case, each model for a given genotype is compared to a constantmodel, where the probability of presence of the genotype is the same at each location2nd is equal to the frequency of the genotype. Significance is assessed with both log-likelihood ratio ( G ) and Wald tests (Dobson and Barnett, 2008). Bonferroni correction isapplied for multiple comparisons. In order to avoid numerous computations of p -values,the significance threshold α is converted to a minimum score threshold. G and Waldscores are used to compare models rather than Akaike or Bayesian information criterionin order to automate model selection.In comparison to MatSAM (Joost et al. , 2008), Sam β ada proposes several improve-ments: faster processing (see Sam β ada implementation in Material and Methods), mul-tivariate analysis and measures of spatial autocorrelation. Contrary to the univariate approach, the multivariate approach uses several environmen-tal variables to model the presence of each genotype. The model selection procedure isadapted to assess the significance of multivariate models. Both G and Wald tests referto a null model to build the null hypothesis. The current model can be compared to theconstant model (the same as in the univariate analysis) using multivariate χ statistics.While rejecting the null hypothesis in this configuration indicates that at least one pa-rameter in the model is statistically significant, it does not provide information aboutwhich parameter(s) is relevant to the model. Therefore Sam β ada assesses parametersignificance in multivariate models with either a Wald test applied to each parameterseparately (except the constant parameter) or with G tests excluding a parameter at thetime, that is to say model selection is based on simpler models nested in the current one.For the G test, if a model A has q parameters, we define the parents of A as the q modelswith q − A . For instance, if A models the occurrence of genotype X i with 3 environmental variables E , E and E , A = ( X i | E , E , E ) , then the parents of A are the three models( X i | E , E ) , ( X i | E , E ) and ( X i | E , E ) . The name reflect the fact that A can be obtained by adding a parameter to any ofits parents. The parent of A with the highest log-likelihood is used as the referencemodel for the significance test. This way, the G score is the smallest possible among allparents, thus if the null hypothesis is rejected, it will also be rejected by comparing A with each of its parents. This method ensures that adding a new parameter leads to abetter modelling of the presence of the genotype. The overall procedure for fitting andselecting models for each genotype begins with the computation of the constant model.Univariate models are built and tested against the constant one, followed by testingbivariate models against their parents, and so forth until the user-defined maximumnumber of parameters is reached. Let us notice that this procedure can be applied toany correlative approach relying on significance tests.3ultivariate models allow to take into account pre-existant knowledge provided thedata is a continuous variable. Specifically, if the population structure was analyzedbeforehand and can be represented as a coefficient of membership for each individual,this information can be included in bivariate models. For models involving both anenvironmental variable and this coefficient, the selection procedure will assess whetherthe environmental variable is associated with the genotype while taking into accountthe possible effect of admixture. In case there are many ancestral populations, severalcoefficients may be included in the analysis, leading to a highly conservative detectionof selection signatures. Beyond detection of selection signatures, Sam β ada quantifies the level of spatial depen-dence in the distribution of each genotype. This measure of spatial autocorrelation refersto similarities or differences between neighbouring individuals that cannot be explainedby chance. Assessing whether the geographic location has an effect on allele frequencyis especially important in landscape genomics since statistical models assume indepen-dence between samples. Thus if individuals with similar genotypes tend to concentratein space, spurious correlations may co-occur with specific values of environmental vari-ables. On the other hand, spatial independence of data strengthens the confidence inthe detections. It has to be noted that this measure of spatial autocorrelation can beused in complement to any method providing a list of loci possibly under selection.Sam β ada measures the global spatial autocorrelation in the whole dataset withMoran’s I , as well as the spatial dependency of each point with Local Indicators ofSpatial Association (LISA) (Moran, 1950; Anselin, 1995). In practice, LISAs are com-puted by comparing the value of each point with the mean value of its neighbours asdefined by a specific weighting scheme based on a kernel function (see supplementarymaterial). Both a spatially fixed kernel type relying on distance only, and a varyingkernel type considering point density can be used. Sam β ada includes three fixed kernels(moving window, Gaussian and bisquare) and a varying one (nearest neighbours). Thesum of LISAs on the whole dataset is proportional to Moran’s I (Anselin, 1995). Signifi-cance assessment relies on an empirical distribution of the indices. For Moran’s I , values(genotype occurrences) are permutated among the locations of individuals in the wholedataset and a pseudo p -value is computed as the proportion of permutations for which I is equal to - or more extreme (higher for a positive Moran’s I or lower for a negativeMoran’s I ) - than the observed I . For LISA, the pseudo p -value is separately computedfor each point (individual), by keeping the individual of interest fixed and permuting itsneighbouring points with the rest of the dataset. This study addressed local adaptation of Ugandan cattle, which is composed of two mainpopulations. Ankole (or Ankole-Watusi) cattle are a Sanga breed (taurine-zebu cross)4ppeared in the Nile Basin around 2.000 years BC. They migrated southward and arenow found in South-West Uganda, Rwanda and Burundi (Ajmone Marsan et al. , 2010;Ndumu et al. , 2008). Shorthorn zebus were introduced in East Africa around the VIII th century AD, they later spread as they were less affected than taurine and Sanga cattleto rinderpest, but their susceptibility to trypanosomiasis is presumed to have restrainedtheir progression (Ajmone Marsan et al. , 2010). Shorthorn zebus are now common inNorth-East Uganda and are interbreeding with Ankole cattle in the center of the country.These elements match the results of the population structure analysis (see Populationstructure in Material and Methods and figure S1.) Four approaches were applied to detect selection signatures among 40,019 SNPs from804 samples (see Molecular data and Alternative methods to detect selection in Materialand Methods). The statistical significance threshold for Sam β ada, LFMM and Arlequinwas set to 1% before applying Bonferroni correction. For BayEnv, model selection wasbased on the distribution of Bayes Factors (BF) for neutral loci (Coop et al. , 2010).Results were analysed separately for each environmental variable and models showing aBF higher than the 1 st percentile of the neutral distribution were candidate loci. ForBayEnv and Arlequin, samples were classified into populations using a threshold of 0 . β ada identified 2,499 SNPs (6.2 %) potentially subjectto selection, BayEnv 1977 (4.9 %), LFMM 280 (0.7 %) and Arlequin did not identifyany loci as significant. The loci detected by Sam β ada with the highest G scores werecompared among methods in table S3. Thirty-six loci were identified by the three cor-relative methods and three of them were among the most significant models in Sam β ada(Table S3). These three SNPs occur close to each other in chromosome five.Sam β ada’s multivariate analysis, which included the population structure as an envi-ronmental variable (see Variable selection for multivariate models in Material and Meth-ods), identified 84 significant bivariate models, corresponding to 29 loci. In Sam β ada’sframework, this means these models provided a significantly more accurate estimationof the genotype’s frequency than their univariate parents, while at least one of theirparents was also significant. Among those, three models that included the “populationstructure” variable also had a parent model showing a significant association with thisvariable. Therefore, although the population structure partly explains the distributionof these genotypes, adding an environmental variable provided a significantly more ac-curate estimation of their distribution ( p ≤ . · − ). These models correspond to5hree loci that were detected by all correlative approaches. Global and local indicators of spatial autocorrelation were computed for two genotypeswith a weighting scheme based on the 20 nearest neighbours: ARS-BFGL-NGS-113888(hereon ARS-113) (allele GG), which was detected by Sam β ada with the highest G scoreand was also detected by BayEnv, was compared with Hapmap28985-BTA-73836 (hereon HM-28) (allele GG), which was detected by Sam β ada, BayEnv and LFMM. Logis-tic models significantly associated both genotypes with isothermality, which measuresthe stability of temperature during the year. Figure 1 shows local indices of spatialautocorrelation for these 2 genotypes. On the one hand, ARS-113 (GG) was positivelyautocorrelated for the majority of points and the indicator was significant for half ofthem. Thus the distribution of this marker shows spatial dependence, non-significantassociations were found at the edge of Lake Victoria and in a corridor in the North ofthe Lake with some occurrences in the West of Uganda. This widespread pattern ofspatial autocorrelation could originate from the underlying population structure, sinceAnkole cattle are more common in the South-West while zebus are more common in theNorth-East of the country. Thus the correlation detected by logistic regressions betweenARS-113 (GG) and environmental variables could be spurious and due to demographicfactors. On the other hand, the local indicators of spatial association of HM-28 (GG)showed lower values in general and were only significant in the North-West of Uganda.This particular region also showed the lowest values of isothermality in Uganda, i.e. ahigh variability of temperatures. The low value of spatial autocorrelation for HM-28(GG) implies that the distribution of this genotype was mostly independent from thelocation and this supports a possible adaptive origin of the observed correlation betweenHM-28 (GG) and isothermality with logistic models. This correlation between HM-28(GG) and isothermality also appeared with bivariate LISAs, where the presence of thegenotype was compared to the mean value of isothermality among neighbouring points(not shown). The key features of Sam β ada are the multivariate modelling and the measure of spa-tial autocorrelation. Both can support the interpretation of results in case the datasetfeatures population structure. Bivariate models may indeed include the global ancestrycoefficients provided by a preliminary analysis. This setup can detect which loci are cor-related with the environment while taking demography into account. Additionally, theintroduction of measurements of spatial autocorrelation into these analyses integratesspatial statistics with landscape genomics. Contrary to most current and non-spatialapproaches (e.g. Frichot et al. , 2013; Coop et al. , 2010), Sam β ada allows the determi-nation of whether the observed data reflects independent samples, a requirement of theunderlying statistical model. Measuring spatial autocorrelation assesses whether the oc-6 a) ARS-113 (GG) (b) HM-28 (GG) Figure 1: Local indicators of spatial association of markers ARS-113 (allele GG) andHM-28 (allele GG). The weighting scheme is based on the 20 nearest neighbours. Redpoints tend to be similar to their neighbours while blue points differ from them. Yellowpoints are independent from their neighbourhood. Small points indicate non-significantvalues ( p > . β ada detected the highest number of SNPs as potentiallysubject to selection among the four approaches. However when comparing the positionsof these SNPs, 1,029 of them were less than 100,000 base pairs apart from another de-tected locus, thus some of these detections might refer to the same signature of selection.Sam β ada’s results partially match with those of BayEnv with 435 common SNPs (i.e.22% of BayEnv’s detections). Concerning the third correlative approach, LFMM is moreconservative than Sam β ada but the correspondence is better since 154 loci (out of 280,i.e. 55% of LFMM’s detections) are detected by both methods. Moreover, 25 SNPs de-tected by LFMM only are less than 100,000 base pairs apart from a locus detected bySam β ada, potentially identifying the same selection signature. The order of detectionsdiffered between the two methods, as the most significant loci detected by Sam β adaare ignored by LFMM. Lastly, Arlequin’s best results involved 17 SNPs with p -valueslower than 10 − (significance threshold: α = 2 . · − ), out of which 2 were commonwith Sam β ada and 16 were common with BayEnv. This result suggests that population-based methods, whether using outliers or environmental correlations, tend to detect thesame selection signatures. On the one hand, Sam β ada’s detection rate may indicatethe occurrence of some false positives due to population structure; on the other hand,the discrepancy between the results may indicate that the more conservative approacheshave some false negatives. Comparing the results in the light of spatial dependence givesinformation about the differences between Sam β ada’s and LFMM’s detections. Maps oflocal spatial autocorrelation for ARS-113 (GG) and HM-28 (GG) illustrated a generaltrend: LFMM discarded SNPs showing significant local spatial autocorrelation for a largeproportion of the sampling locations, while Sam β ada detected them. Thus measuringlocal autocorrelation of candidate genotypes may help distinguishing between the effectsof local adaptation and those of population structure among Sam β ada’s detections.Regarding common detections, the three SNPs identified by Sam β ada when popula-tion structure was included as a covariate were among the common detections of correla-tives approaches. Thus pre-existent knowledge on demography may be built on to refinecorrelation-based detections of selection signatures. One possible approach could consistof computing population structure and then including one variable summarising thisstructure in the constant model used by Sam β Glossina spp.) occur in the country (Abila8 t al. (2008); MAAIF et al. , 2010). Hence the risk of cattle trypanosomiasis is high inthis region and the detected mutations may be involved in parasite resistance.The increasing availability of large molecular datasets raises challenges regardingtheir analysis. Correlative approaches in landscape genomics enable fast detection ofcandidate loci to local adaptation. However these methods must take into account theeffect of population structure (Frichot et al. , 2013; Joost et al. , 2013; De Mita et al. , 2013).Limited dispersal of individuals leads to spatial autocorrelation of marker frequencies,which may cause spurious correlations with the environment. Sam β ada addresses thefirst topic by detecting rapidly selection signatures and the second one by measuringthe level of spatial autocorrelation for candidate loci. The next methodological step in-volves developing spatially-explicit models that directly include autocorrelation. Guillot et al. (2014) provide such a model, however the current R-based implementation doesnot enable whole-genome analysis. Alternatively Geographically Weighted Regressions(GWR) measure the spatial stationarity of regression coefficients by fitting a distinctmodel for each sampling location. The number of neighbouring points considered foreach sampling location is given by the weighting scheme. These models allow some “lo-cal” coefficients to differ between sampling points while some “global” coefficients arecommon to all points (Fotheringham et al. , 2002; Joost et al. , 2013). Thus GWR enablesbuilding a null model where the constant term may vary in space and then refining itby adding a global environmental effect for all locations. Comparing these two modelswould enable an assessment of whether the global environmental effect is needed to de-scribe the distribution of the genotype. The key advantage of allowing the constant termto vary in space is to take spatial autocorrelation into account in the models. This way,GWR allows an investigation of the spatial behaviour of loci showing selection signaturewith standard logistic regressions and may help to distinguish between local adaptationand population structure in landscape genomics. However GWR models require a fine-tuning of the weighting scheme from the user, which restrains their application to verylarge datasets.The recent availability of whole genome sequence (WGS) data also raises issues re-garding the statistical assessment of multiple comparisons. Indeed, while many indi-viduals and few genetic markers were available ten years ago, the current high costs ofWGS limit the number of sequenced samples. Therefore standard procedures for multi-ple comparisons, such as the Bonferroni correction, are over-conservative and may leadto discard some adaptive loci. In this context, alternatives procedures focus on con-trolling the ratio of false positives in a set of significant results. Among them, Storeyand Tibshirani’s false discovery rate (2003) was especially designed for large moleculardatasets and suits any detection method relying on significance tests. Its implementationin Sam β ada is ongoing.Computation time is critical when processing large datasets. In this context, Sam β adais able to swiftly analyse high-density SNP-chips and variants from whole-genome se-quencing (e.g. the case study presented in here is analysed within 69 minutes for uni-variates models alone and 8.5 hours for both univariate and bivariate models using asingle core at 3.1 GHz on a desktop computer with 8 GB of RAM). When considering9ingle-process computations, Sam β ada is approximately 4.5 times quicker than LFMMand 30 times than BayEnv (see table S2). Both Sam β ada and LFMM enable parallelisedprocessing. Sam β ada’s processing speed, combined with its ability to analyse the spa-tial autocorrelation in molecular data and to incorporate prior knowledge on populationstructure, suits a wide range of applications, especially those involving whole genomesequence data. Sampling was designed to cover the whole country, including each eco-geographic region,and to obtain a homogeneous distribution of individuals across the country. A regulargrid made of 51 cells of 70 x 70 km was produced to this end. The field campaign tookplace between March 2011 and January 2012. On average, four farms were visited ineach cell and four unrelated individuals were selected from each farm, for a total of 917biological samples retrieved from 202 farms. Recorded information also included thelocation of the farm, the name of the breed, a picture and morphological informationon each individual. These elements were stored in a database accessible through a Webinterface, enabling real-time monitoring of the sampling campaign.
Out of the 917 individuals, 813 samples were genotyped with a medium-density SNP chip(54,609 SNPs, BovineSNP50 BeadChip, Illumina Inc., San Diego, CA). Only markerslocated on the autosomal chromosomes were considered in the analyses. The datasetwas filtered with PLINK (Purcell et al. , 2007) with a call rate set to 95% for bothindividuals and SNPs, and a minimum allele frequency (MAF) set to 1%. The resultingdataset contains 804 samples and 40,019 SNPs.
Population structure was analysed with Admixture (Alexander et al. , 2009), which es-timates maximum likelihood of individual ancestries from multilocus SNP genotypedatasets. This approach assumes that samples descend from a predefined number ofancestor populations that are mixing. Admixture estimates both the fraction of eachsample coming from each population and the marker frequencies in these populations.The optimal number of populations is assessed by a k -fold cross-validation procedure.The best partition of the dataset consisted of four populations, although the vastmajority of the samples were allocated to two clusters (almost 96%) on the basis of theancestry coefficients (Figure S1). Mapping these coefficients revealed the two clusters(340 and 431 individuals out of 804) occurred in the South-West and North-East of10ganda respectively. Using pictures of sampled individuals, the first cluster was iden-tified as Ankole cattle and the second one as zebu. The remaining two clusters (32animals) possibly represent introgression from allochthonous gene pools. These resultswere used to define the parameters needed by each method to detect selection signatures. The geographical coordinates of the individuals sampled enabled the characterisation oftheir habitat with the help of the WorldClim dataset containing monthly values of pre-cipitation, minimum, mean and maximum temperature as well as 19 derived variables,at 1km resolution (Hijmans et al. , 2005). This dataset provides appropriate data as itconsists of representative climate information collected during 30 years (WMO standardclimate normal, Arguez and Vose, 2010) and its high resolution suited the scale of ourstudy. These environmental variables were originally stored in four tiles (portions ofmap) which were pasted using the Geospatial Data Abstraction Library (GDAL De-velopment Team, 2013) and a customized Python script. The topography is describedby the 90m resolution SRTM3 (Shuttle Radar Topography Mission) digital elevationmodel (Farr et al.
Variable selection for univariate models
Considering all environmental variablesin the computation of the multiple logistic regressions would have provided a compre-hensive analysis with a low risk of missing detections. Nonetheless some variables arehighly correlated; this implies dependence between models and increases the variance ofparameters in multivariate analyses. Thus we used the Variance Inflation Factor (VIF)to control for multicollinearity (Dobson and Barnett, 2008). A maximum VIF of 5 cor-responds to a coefficient of correlation of 0.9 between pairs of variables. The numberof variables was reduced iteratively by randomly removing one of the two most highlycorrelated variables until the maximum correlation was lower than the threshold (0.9).This procedure led to a set of 23 environmental variables that were used for univariatelandscape genomic analysis (table S1).
Variable selection for multivariate models
Bivariate models were also processedwith Sam β ada to assess the effect of a combination of predictors, and to take the popula-tion structure into account. This information was derived from the analysis of individualancestries (see Population structure). Since there were two main populations of Ugan-dan cattle, the associated coefficients of membership were generally complementary anda single coefficient was sufficient to represent the origin of each individual. Thus a newvariable “population structure” was defined as the coefficient of membership of each11ndividual to the population Ankole. This variable was used to represent the ancestryof each sample in Sam β ada’s analysis. It was added to the set of 23 variables and thecorrelation-based variable selection method was reapplied to limit the VIF to 5. On thisbasis, fifteen environmental variables were considered for Sam β ada bivariate analysis(table S1). β ada implementation Sam β ada is a standalone application written in C++. The application was developedusing the Scythe Statistical Library (Pemstein et al. , 2011) for matrix computation andprobability distributions, which was also chosen for its straightforward application pro-gramming interface (API). Sam β ada is distributed under an open source GNU GeneralPublic License license in order to ease its use for research and teaching. Two complementary versions of the software were developed: a desktop option-rich pro-gram well suited to small to medium-sized datasets, and a High Performance Computingversion dedicated to large datasets.
Desktop version (Sam β ada) Sam β ada includes multivariate analysis and spatialautocorrelation. Many options are provided to facilitate the formatting of the data andto customise the analysis. For instance, the significance of models is assessed duringthe analysis and non-significant associations can be discarded. Moreover models can besorted according to their scores before writing the results in order to make it possibleto directly be in a position to interpret them. All results presented in this paper wereprocessed with Sam β ada Desktop. Parallel computing version (CoreSAM)
When processing large datasets, primaryanalysis usually focuses on univariate models. Multivariate models and spatial autocor-relation may be considered as a second step, but are too computationally intensive tobe applied to the whole dataset. In order to speed-up the process, CoreSAM is a lightversion of Sam β ada, written in C, which focuses on univariate analysis. Compared withSam β ada, fewer options are available, but the computation is up to seven times faster.Combining Sam β ada and CoreSAM, large datasets may be analysed by steps: Uni-variate logistic models identify candidate loci exhibiting selection signatures. These locimay be then investigated in the light of spatial autocorrelation measures and multi-variate models. The former may point out whether the observed correlation is due tosimilarities between neighbours, while the latter allows including the population struc-ture, if any, in the model in order to assess whether the environmental variable providessupplementary information on the marker frequency when taking the demography intoaccount. 12 .2.2 Modules Sam β ada includes several modules that enhance interfacing with other programs. Geovisualization of spatial statistics
Sam β ada provides an option to save thespatial autocorrelation results as a shapefile (.shp), a common format for storing vectorinformation in Geographic Information Systems (GIS). This feature relies on the shplibopen source library ( http://shape-lib.maptools.org ). Recoding molecular data
Sam β ada is distributed with a utility for recoding molec-ular data into binary information. Currently RecodePlink handles ped/map files, aformat for SNP data used by PLINK (Purcell et al. , 2007). Supervision
For very large molecular datasets, Sam β ada provides a module to shareworkload between computers. Supervision splits the input data in several files that canbe processed separately, even on heterogeneous computers. Lastly, Supervision mergesthe results to provide the same output as if the whole dataset had been processed atonce. For comparison purpose, three alternative approaches to Sam β ada were used to detectsignatures of selection in Ugandan cattle data. Two of these are correlative approaches(BayEnv and Latent Factor Mixed Models, Coop et al. , 2010; Frichot et al. , 2013), whilethe third is an outlier-detection population genetics approach (Beaumont and Nichols,1996) included in Arlequin 3.5 (Excoffier and Lischer, 2010). These methods considerallele counts whereas Sam β ada recodes them into genotypes. BayEnv uses a Bayesian approach to detect candidate SNPs under selection while ac-counting for the inherent correlation in allele frequencies across populations due to shareddemographic history (Coop et al. , 2010). BayEnv first uses a set of neutral loci to build anull model robust to demographic history, against which an alternative model includingan environmental variable is compared. Markers exhibiting the highest Bayes factorsare potentially subject to selection. In this study the set of neutral loci was chosen atrandom among loci identified as neutral by Sam β ada. LFMM is a Bayesian individual-based approach to detect selection in landscape genomics(Frichot et al. , 2013). Population structure is added into the model via unobservedvariables. Thus the significance of the association between genome and environment canbe assessed while taking into account the effect of the population structure. The number13f latent factors (unobserved variables) must be specified for the analysis. Although thisnumber is related to the population structure, it is usually lower than the number ofpopulations (Frichot personal communication).
Arlequin is a comprehensive software for population genetics analyses (Excoffier and Lis-cher, 2010) that includes an outlier-based method to detect signatures of selection (Beau-mont and Nichols, 1996). This approach assumes an island model, where individuals aresampled in distinct populations that exchange migrants. Each locus is characterised bythe fixation index F ST (Wright, 1949) and its heterozygosity. Neutral coalescent simu-lations are used to estimate the distribution of F ST conditional on heterozygosity. Lociexhibiting extreme values of F ST are candidate targets of selection. Sam β ada is an open source software written in C++ available at http://lasig.epfl.ch/sambada (under the license GNU GPL 3). Compiled versions are provided for Win-dows, Linux and MacOS X. NextGen data are described at http://projects.ensembl.org/nextgen/ . Ugandancattle SNP data are available at ftp://ftp.ebi.ac.uk/pub/databases/nextgen/bos/variants/chip_array/ in PLINK format (files UGBT.bovineSNP50.UMD3_1.20140307.[ped/map].gz)with the following data policy ftp://ftp.ebi.ac.uk/pub/databases/nextgen/documentation/README_data_use_policy . Supplementary material, figure S1, and tables S1–S3 are available at Molecular Biologyand Evolution online ( ). Funding:
This research was funded by EU FP7 project NextGen (Grant KBBE-2009-1-1-03).The authors would like to thank Sergio Rey for his advice on assessing the significanceof LISA, Stephan Morgenthaler for fruitful discussions on assessing the significance ofmultivariate logistic models and Gilles Guillot for providing them with SGLMM fortesting purposes. 14 eferences
Abila, P. P., Slotman, M. A., Parmakelis, A., Dion, K. B., Robinson, A. S., Muwanika,V. B., Enyaru, J. C. K., Lokedi, L. M., Aksoy, S., and Caccone, A. 2008. Highlevels of genetic differentiation between ugandan
Glossina fuscipes fuscipes populationsseparated by lake kyoga.
PLOS Neglected Tropical Diseases , 2(5): e242.Ajmone Marsan, P., Garcia, J. F., Lenstra, J. A., and Consortium, G. 2010. On theorigin of cattle: How aurochs became cattle and colonized the world.
EvolutionaryAnthropology , 19(4): 148–157.Alexander, D. H., Novembre, J., and Lange, K. 2009. Fast model-based estimation ofancestry in unrelated individuals.
Genome Research , 19(9): 1655–1664.Anselin, L. 1995. Local Indicators of Spatial Association - LISA.
Geographical Analysis ,27(2): 93–115. GISDATA (Geographic Information Systems Data) Specialist Meetingon GIS (Geographic Information Systems) and Spatial Analysis, Amsterdam, Nether-lands, Dec 01-05, 1993.Arguez, A. and Vose, R. S. 2010. The definition of the standard WMO climate normal:The key to deriving alternative climate normals.
Bulletin of the American Meteoro-logical Society , 92(6): 699–704.Beaumont, M. A. and Nichols, R. A. 1996. Evaluating loci for use in the genetic analysisof population structure.
Proceedings Royal Society London B , 263: 1619–1626.Colli, L., Joost, S., Negrini, R., Nicoloso, L., Crepaldi, P., Ajmone-Marsan, P., andConsortium, E. 2014. Assessing the spatial dependence of adaptive loci in 43 Europeanand Western Asian goat breeds using AFLP markers.
PLOS ONE , 9(1).Coop, G., Witonsky, D., Di Rienzo, A., and Pritchard, J. K. 2010. Using environmentalcorrelations to identify loci underlying local adaptation.
Genetics , 185(4): 1411–1423.De Mita, S., Thuillet, A.-C., Gay, L., Ahmadi, N., Manel, S., Ronfort, J., and Vigouroux,Y. 2013. Detecting selection along environmental gradients: analysis of eight methodsand their effectiveness for outbreeding and selfing populations.
Molecular Ecology ,22(5): 1383–1399.Dobson, A. J. and Barnett, A. G. 2008.
An Introduction to Generalized Linear Models .Chapman & Hall, 3 edition.Excoffier, L. and Lischer, H. E. L. 2010. Arlequin suite ver 3.5: a new series of programsto perform population genetics analyses under Linux and Windows.
Molecular EcologyRessources , 10(3): 564–567.Farr, T. G., Rosen, P. A., Caro, E., Crippen, R., Duren, R., Hensley, S., Kobrick,M., Paller, M., Rodriguez, E., Roth, L., Seal, D., Shaffer, S., Shimada, J., Umland,15., Werner, M., Oskin, M., Burbank, D., and Alsdorf, D. 2007. The shuttle radartopography mission.
Reviews of Geophysics , 45(2).Foll, M. and Gaggiotti, O. 2008. A genome-scan method to identify selected loci appro-priate for both dominant and codominant markers: A Bayesian perspective.
Genetics ,180: 977–993.Fotheringham, A. S., Brunsdon, C., and Charlton, M. 2002.
Geographically WeightedRegression: the analysis of spatially varying relationships . John Wiley & Sons, Chich-ester, 1 edition.Frichot, E., Schoville, S. D., Bouchard, G., and François, O. 2013. Testing for associationsbetween loci and environmental gradients using latent factor mixed models.
MolecularBiology and Evolution , 30(7): 1687–1699.GDAL Development Team 2013.
GDAL - Geospatial Data Abstraction Library, Version1.10 . Open Source Geospatial Foundation.Guillot, G., Vitalis, R., le Rouzic, A., and Gautier, M. 2014. Detecting correlationbetween allele frequencies and environmental variables as a signature of selection. Afast computational approach for genome-wide studies.
Spatial Statistics , 8: 145–155.Günther, T. and Coop, G. 2013. Robust identification of local adaptation from allelefrequencies.
Genetics , 195(1): 205–220.Henry, P. and Russello, M. A. 2013. Adaptive divergence along environmental gradientsin a climate-change-sensitive mammal.
Ecology and Evolution , 3(11): 3906–3917.Hijmans, R., Cameron, S., Parra, J., Jones, P., and Jarvis, A. 2005. Very high reso-lution interpolated climate surfaces for global land areas.
International Journal OfClimatology , 25(15): 1965–1978.Jones, M., Forester, B., Teufel, A., Adams, R., Anstett, D., Goodrich, B., Joost, S., andManel, S. 2013. Integrating spatially explicit approaches to detect adaptive loci in alandscape genomics context.
Evolution , 67: 3455–3468.Joost, S., Bonin, A., Bruford, M. W., Després, L., Conord, C., Erhardt, G., and Taberlet,P. 2007. A spatial analysis method (SAM) to detect candidate loci for selection:towards a landscape genomics approach to adaptation.
Molecular Ecology , 16(18):3955–3969.Joost, S., Kalbermatten, M., and Bonin, A. 2008. Spatial Analysis Method (SAM): asoftware tool combining molecular and environmental data to identify candidate locifor selection.
Molecular Ecology Ressources , 8: 957–960.Joost, S., Vuilleumier, S., Jensen, J. D., Schoville, S., Leempoel, K., Stucki, S., Widmer,I., Melodelima, C., Rolland, J., and Manel, S. 2013. Uncovering the genetic basis ofadaptive change: on the intersection of landscape genomics and theoretical populationgenetics.
Molecular Ecology , 22(14): 3659–3665.16v, F.-H., Agha, S., Kantanen, J., Colli, L., Stucki, S., Kijas, J. W., Joost, S., Li, M.-H.,and Ajmone Marsan, P. 2014. Adaptations to climate-mediated selective pressures insheep.
Molecular Biology and Evolution , advanced access.Ministry of Agriculture, Animal Industry and Fisheries, Uganda, Uganda Bureau ofStatistics, Food and Agriculture Organization of the United Nations, InternationalLivestock Research Institute, and World Resources Institute 2010.
Mapping a betterfuture: Spatial Analysis and Pro-Poor Livestock Strategies in Uganda , pages 30–37.World Resources Institute, Washington, DC and Kampala.Mitton, J. B., Linhart, Y. B., Hamrick, J. L., and Beckman, J. S. 1977. Observationson genetic structure and mating system of ponderosa pine in Colorado front range.
Theoretical and Applied Genetics , 51(1): 5–13.Moran, P. A. P. 1950. Notes on continuous stochastic phenomena.
Biometrika , 37(1/2):pp. 17–23.Ndumu, D. B., Baumung, R., Hanotte, O., Wurzinger, M., Okeyo, M. A., Jianlin, H.,Kibogo, H., and Soelkner, J. 2008. Genetic and morphological characterisation ofthe Ankole Longhorn cattle in the African Great Lakes region.
Genetics SelectionEvolution , 40(5): 467–490.Pemstein, D., Quinn, K. M., and Martin, A. D. 2011. The Scythe statistical library: Anopen source C++ library for statistical computation.
Journal of Statistical Software ,42(12): 1–26.Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller,J., Sklar, P., de Bakker, P. I., Daly, M. J., and Sham, P. C. 2007. PLINK: A tool setfor whole-genome association and population-based linkage analyses.
The AmericanJournal of Human Genetics , 81(3): 559 – 575.Storey, J. D. and Tibshirani, R. 2003. Statistical significance for genomewide studies.
Proceedings of the National Academy of Sciences of the United States of America ,100(16): 9440–9445.Vitalis, R., Dawson, K., Boursot, P., and Belkhir, K. 2003. DetSel 1.0: A computerprogram to detect markers responding to selection.
Journal of Heredity , 94(5): 429–431.Vitti, J. J., Grossman, S. R., and Sabeti, P. C. 2013. Detecting Natural Selection inGenomic Data. In Bassler, BL and Lichten, M and Schupbach, G, editor,
AnnualReview of Genetics, Vol 47 , volume 47 of
Annual Review of Genetics , pages 97–120.Annual Reviews, 4139 EL CAMINO WAY, PO BOX 10139, PALO ALTO, CA 94303-0897 USA.Wright, S. 1949. The genetical structure of populations.
Annals of Eugenics , 15(1):323–354. 17 igh performance computation of landscape genomicmodels integrating local indices of spatial association
Supplementary material
Several indices are available for measuring the global spatial autocorrelation in a dataset. Sam β adauses the Moran’s I (Moran, 1950), defined as follows: I = nS P ni =1 P nj =1 w ij ( y i − ¯ y )( y j − ¯ y ) P nk =1 ( y k − ¯ y ) = nS P ni =1 P nj =1 w ij z i z j P nk =1 z k (1)with n number of sampling points; w ij weight of point j in the neighbourhood of i, defined by the spatial kernel; S sum of all weights (cid:16) S = P ni =1 P nj =1 w ij (cid:17) ; y i , y j values for points i and j;¯ y mean value; z i , z j deviations from the mean.Please note that the weight matrix is usually normalized per line (cid:16)P nj =1 w ij = 1 (cid:17) , so that S is equal to n . Thus equation 1 can be rewritten as: I = P ni =1 P nj =1 w ij z i z j P nk =1 z k (2)where w ij is the normalized weight of point j in the neighbourhood of i.Local indices of spatial association (LISA, Anselin, 1995) measure the local association be-tween the value of a point and the neighbouring points. Sam β ada computes a local variant ofthe Moran’s I for each point i : I i = n − P nk =1 z k z i n X j =1 w ij z j (3)LISAs are defined in such a way that their sum over all points is proportional to a globalmeasure of spatial autocorrelation (here the Moran’s I , Anselin, 1995): I = 1 n − n X i =1 I i (4)1 a r X i v : . [ q - b i o . P E ] N ov Environmental variables
Variable Description Data source Used forunivariateanalysis Used forbivariateanalysisalt_SRTM Altitude [m] SRTM3 Xaspect Orientation of the relief [ ◦ ] Derived fromSRTM3 X XBIO2 Mean Diurnal Range WorldClim X X(Mean of monthly (max temp -min temp))BIO3 Isothermality (BIO2/BIO7) (*100) X XBIO7 Temperature Annual Range(max temp - min temp) XBIO9 Mean Temperature of DriestQuarter XBIO12 Annual Precipitation X XBIO15 Precipitation Seasonality(Coefficient of Variation) X XBIO18 Precipitation of WarmestQuarter X Xlatitude Latitude Samplingmeasurements X Xlongitude Longitude X Xprec2 Precipitation in February WorldClim Xprec3 Precipitation in March Xprec4 Precipitation in April X Xprec5 Precipitation in May X Xprec6 Precipitation in June Xprec7 Precipitation in July Xprec9 Precipitation in September Xprec10 Precipitation in October X Xprec11 Precipitation in November X Xslope Slope of the relief [%] Derived fromSRTM3 X Xtmin10 Minimal temperature in October WorldClim X Xtmax10 Maximal temperature in October XAnkole Coefficient of ancestry to thepopulation Ankole Analysis withAdmixture X Number of variables
23 15Table S1: Environmental variables used to detect selection signatures with correlative ap-proaches. Univariate analyses were performed with Sambada, BayEnv and LFMM and bivariateanalyses with Sambada 2
Population structure
Figure S1: Population structure computed with Admixture (Alexander et al. , 2009). Individualsare gathered together by populations, labeled horizontally. The assignation is based on thehighest membership coefficient Q max of the sample. Inside each population, individuals areranked by increasing (or decreasing) value of Q max . β ada 1.2 2.9Sam β ada biv. 8.7 18.4BayEnv 41.3 62.,2LFMM 3.2 16.0LFMM (mono) 6.1 58.1Table S2: Comparison of computation times among methods. The datasets used in this case in-clude chromosome X. The first column refers to the data used in the article (SNPs and individualcall rates=5%, MAF=1%, including chr. X). The second column refers to an unpublished datasetof 102 samples of Ugandan cattle that were chosen among the 917 samples to be genotyped witha high-density SNP chip (BovineHD, Illumina Inc; SNPs and ind. call rates=5%, MAF=5%).Durations are expressed in hours. “LFMM (mono)” shows the duration if using a single thread.Analyses were run on a desktop computer with an 8-core CPU at 3.1 GHz and 8 GB of RAM.No computation time is available for Arlequin. 3 Comparison of detections
Loci Chr. Pos. [Mbp] S a m β a d a B a y E n v L F MM A r l e q u i n Detections1 ARS-BFGL-NGS-113888 5 48.32 1 1 0 0 22 Hapmap41074-BTA-73520 5 48.35 1 1 0 0 23 Hapmap41762-BTA-117570 5 18.94 1 1 0 0 24 ARS-BFGL-NGS-46098 20 2.95 1 1 0 0 25 Hapmap41813-BTA-27442 5 49.04 1 1 0 0 26 BTA-73516-no-rs 5 48.75 1 1 0 0 2
11 Hapmap50523-BTA-98407 5 46.74 1 1 0 0 212 BTB-01400776 20 2.70 1 1 0 0 213 Hapmap23956-BTA-36867 15 47.20 1 1 0 0 214 ARS-BFGL-NGS-10586 2 128.64 1 1 0 0 215 ARS-BFGL-NGS-43694 5 49.65 1 1 0 0 216 BTA-122374-no-rs 14 16.44 1 1 0 0 217 BTB-01356178 20 2.49 1 1 0 0 2
18 ARS-BFGL-NGS-94862 11 103.53 1 1 1 0 3
19 BTA-108359-no-rs 14 16.31 1 0 0 0 120 ARS-BFGL-NGS-15960 5 28.02 1 1 0 0 221 ARS-BFGL-NGS-116294 2 128.58 1 1 0 0 222 INRA-566 13 57.94 1 0 1 0 2
23 BTA-49720-no-rs 5 69.66 1 1 1 0 3
24 ARS-BFGL-NGS-56387 13 24.36 1 1 0 0 225 BTA-28185-no-rs 26 22.78 1 0 0 0 1Table S3: List of SNPs detected by Sam β ada corresponding to the models with the highest G scores. Loci are identified by their name, their chromosome and their position in million basepairs. The following columns show which method detected them and the last one counts thesedetections. Loci in bold type are the commons discoveries of Sam β ada, LFMM and BayEnv.Local indices of spatial autocorrelation were computed for SNPs on lines 1 and 7. References
Alexander, D. H., Novembre, J., and Lange, K. 2009. Fast model-based estimation of ancestryin unrelated individuals.
Genome Research , 19(9): 1655–1664.Anselin, L. 1995. Local Indicators of Spatial Association - LISA.
Geographical Analysis , 27(2):93–115. GISDATA (Geographic Information Systems Data) Specialist Meeting on GIS (Ge-ographic Information Systems) and Spatial Analysis, Amsterdam, Netherlands, Dec 01-05,1993. 4oran, P. A. P. 1950. Notes on continuous stochastic phenomena.