The GenoChip: A New Tool for Genetic Anthropology
Eran Elhaik, Elliott Greenspan, Sean Staats, Thomas Krahn, Chris Tyler-Smith, Yali Xue, Sergio Tofanelli, Paolo Francalacci, Francesco Cucca, Luca Pagani, Li Jin, Hui Li, Theodore G. Schurr, Bennett Greenspan, R. Spencer Wells, Genographic Consortium
11 The GenoChip: A New Tool for Genetic Anthropology
Eran Elhaik , Elliott Greenspan , Sean Staats , Thomas Krahn , Chris Tyler-Smith , Yali Xue , Sergio Tofanelli , Paolo Francalacci , Francesco Cucca , Luca Pagani , Li Jin , Hui Li , Theodore G. Schurr , Bennett Greenspan , R. Spencer Wells and the Genographic Consortium Department of Mental Health, Johns Hopkins University Bloomberg School of Public Health, 615 N. Wolfe Street, Baltimore, MD 21205, USA Family Tree DNA, Houston, TX 77008, USA The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK Department of Biology, University of Pisa, Italy Department of Natural and Environmental Science, Evolutionary Genetics Lab, University of Sassari, Italy National Research Council, Monserrato, Italy Division of Biological Anthropology, University of Cambridge, UK Fudan University, Shanghai, China University of Pennsylvania, Philadelphia, PA National Geographic Society, Washington DC, USA *Please address all correspondence to Spencer Wells at [email protected]
Abstract
The Genographic Project is an international effort using genetic data to chart human migratory history. The project is non-profit and non-medical, and through its Legacy Fund supports locally led efforts to preserve indigenous and traditional cultures. While the first phase of the project was focused primarily on uniparentally-inherited markers on the Y-chromosome and mitochondrial DNA, the next is focusing on markers from across the entire genome to obtain a more complete understanding of human genetic variation. In this regard, genomic admixture is one of the most crucial tools that will help us to analyze the genetic makeup and shared history of human populations. Although many commercial arrays exist for genome-wide SNP genotyping, they were designed for medical genetic studies and contain medically related markers that are not appropriate for global population genetic studies. GenoChip, the Genographic Project’s new genotyping array, was designed to resolve these issues and enable higher-resolution research into outstanding questions in genetic anthropology. We developed novel methods to identify ancestry informative markers (AIMs) and genomic regions that may be enriched with alleles shared with ancestral hominins. Overall, we collected and ascertained AIMs from over 450 populations. Containing an unprecedented number of Y-chromosomal and mtDNA SNPs and over 130,000 SNPs from the autosomes and X-chromosome, the chip was carefully vetted to avoid inclusion of medically relevant markers. The GenoChip results were successfully validated by genotyping more than 500 samples from the 1000 Genomes Project and private collections. To demonstrate its capabilities, we compared the F ST distributions of GenoChip SNPs to those of two commercial arrays for three continental populations. While all arrays yielded similarly shaped (inverse J) F ST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean F ST (0.10 and 0.13, respectively), attesting to its ability to discern subpopulations. In summary, the GenoChip is a dedicated genotyping platform for genetic anthropology and promises to be the most powerful tool available for assessing population structure and migration history. Introduction
Apportionment of human genetic variation has long established that all living humans are related via recent common ancestors who lived in sub-Saharan Africa about 200,000 years ago [1]. The world outside Africa was settled over the past 50,000-100,000 years [2,3] when the descendents of our African forebearers spread out to populate other continents [4]. While genome-wide data depict a single major dispersal of modern humans from Africa, recent findings suggest at least two archaic admixture (interbreeding) events with extinct hominins, one with Neanderthals in Eurasia and the second with Denisovans in Southeast Asia [5,6]. The recurrent interbreeding, admixture, and migration events shaped modern populations into mosaics of ancient and recent haplotypes – varying only by the proportion of their ancestral genetic blocks, not in the building blocks themselves. Passed down the sex lines nearly unaltered from one generation to the next, these blocks carry the story of our most recent common ancestors' journeys. Although highly similar, the subtle genetic differences between populations are like breadcrumbs allowing us to trace human past through mixing events and migrations. The advent of next-generation sequencing technology that enabled the rapid genotyping of hundreds of thousands of genetic markers revolutionized genetic anthropology and launched several human sequencing projects, in the process broadening our understanding of genetic diversity and migration history [e.g., 7,8,9,10]. The rapidly escalating interest in ancestral histories of foreign continents and geographical regions, and the greater antiquity of populations from Africa [11], India [12], the Americas [13], the Caucasus [14], and Oceania [15], have already yielded deeper splits among these regions and complex patterns of past migration and admixture events. Despite this tremendous progress, many unresolved problems hinder progress in the field. First, only a handful of the estimated 6,000 global population groups were genotyped and studied, which limits the phylogeographic resolution of the findings. Second, the plethora of genetic markers obtained from different genotyping platforms has resurrected the “empty matrix” problem, whereby populations from different studies can barely be compared. Finally, genotyping costs remained prohibitively high and unjustified as commercial genotyping platforms do not accommodate crucial ancestry informative markers (AIMs) [16]. Furthermore, these arrays are also enriched in trait- or disease-related markers, which prompts a host of psychological, social, legal, political, and ethical concerns from the individual, population, and global levels [17]. The first phase of the Genographic Project focused on reconstructing human migration through the analysis of uniparentally-inherited markers on the Y-chromosome and mitochondrial DNA. This phase yielded three dozen publications that described the population structure of modern populations [e.g., 14], traced migration routes of ancient populations [e.g., 18], improved the resolution of Y and mtDNA phylogenetic trees [e.g., 19,20,21], and attempted to identify genetic relatedness to historical figures [e.g., 22,23]. The project has not only been successful in inferring details of human migration histories, but also in attracting public participants interested in testing their own DNA to better understand their genetic ancestry. As of December 2012, over 500,000 public participants from more than 130 countries have purchased Genographic public participation kits, of which over 200,000 have submitted their results to an anonymous database to be used for research purposes. The success of the Genographic Project has also helped to spawn a broader interest in what has been dubbed “genetic genealogy.” Unlike other fields of science where the knowledge and excitement of new discoveries are often confined to members of the scientific community, genealogy, and particularly genetic genealogy, has a very large, devoted, supportive, and highly active public community. Since the 1970s, genealogy has become one of the most popular avocations, enjoyed by millions of people around the world seeking to trace their ancestry and learn about the geographical regions in which their ancestors lived [24]. While traditional genealogy requires written records, which only extend several generations into the past at most, genetic genealogy is easily available to everyone in the form of direct-to-consumer genetic ancestry tests and friendly computational interfaces that visualize and explain the results [25]. The increasing fascination with genetics has prompted entrepreneurs to offer self-test kits and provide information ranging from disease risk and life-style choices to genetic ancestry [26]. Some of the existing solutions have been criticized for making deceptive health-related claims and for providing limited and imprecise answers regarding ancestry [17]. The concerns about ancestry reporting were not unjustified, as these entrepreneurs adopted problematic commercial genotyping platforms for assessing genetic background. While the uniparental markers used in the first phase of the Genographic Project remain useful for studies of human migration history, many questions cannot be answered without assessing patterns of genetic variation in the rest of the genome. Legal and ethical obstacles often hampered autosomal-driven studies due to the difficulty in discerning markers that are informative for genetic anthropology from those having medical relevance. Since the launch of the Genographic Project and similar endeavors, a growing number of individuals and populations have expressed serious concerns about the sharing and potential exploitation of their medical information [17,27]. Unfortunately, all commercially available genotyping platforms were optimized for medical genetic work, making them unsuitable for population genetic studies due to privacy concerns and biased allele frequencies. Even the exceptional Affymetrix Human Origins array, which was also designed for population genetic studies [28], not only contains tens of thousands of coding and medically-related markers, but was ascertained based on data from only 12 populations, which may limit its applicability to study worldwide genetic diversity. To reconcile the competing interests in investigating population genetic history and identifying variants contributing to health, we decided to develop a novel SNP array – which we call the GenoChip. Our goals were to design, manufacture, validate, and test a state of the art SNP array dedicated solely to genetic anthropology and to genotype a large number of samples and populations to learn more about human history. The GenoChip is an Illumina HD iSelect genotyping bead array designed with over 130,000 highly informative autosomal and X-chromosomal markers, ~12,000 Y-chromosomal markers, and ~3,300 mtDNA markers. As AIMs are invaluable tools in population genetics and genetic anthropology for discerning subpopulations, we focused our efforts in collecting the most informative ones [29]. The autosomal and X-chromosomal marker sets comprised of over 75,000 AIMs were ascertained from over 450 worldwide populations, making it the largest and most detailed AIMs panel ever constructed (Figure 1). Half of our AIMs were culled from the literature and private data sets and the rest were calculated using infocalc [30] and
AIMsFinder . We applied these two methods on global panels comprised of nearly 300 populations assembled from public and private data sets. Many of these populations are unique to our project and have never before studied or searched for AIMs. We used infocalc to identify AIMs in population panels organized by the source of the data, whereas
AIMsFinder was applied in a pairwise fashion over all populations. We excluded SNPs in high LD ( r >0.4) in all populations, except for hunter-gatherer like the Hadza and Sandawe of Tanzania [31] and Melanesian populations [10]. To determine the extent of gene flow from Neanderthal and Denisovan to modern humans, we collected from the literature [5,32,33] SNPs and haplotypes from genomic regions bearing evidence of interbreeding. In addition, we used a modified version of IsoPlotter r >0.8) with any of the other SNPs included in our dataset. Overall, we assembled nearly 170,000 SNPs dedicated for genetic anthropological and genealogical research without any known health, medical, or phenotypic relevance (Table S1). We validated the results of the GenoChip by genotyping 168 worldwide samples from the 1000 Genomes Project and cross-validating the autosomal genotypes. The concordance rate per sample was over 99.5%. The marginal error rate was expected due to the low coverage of the 1000 Genomes Project data, particularly for rare alleles [7]. An additional 400 samples were genotyped to test the ability of the GenoChip to infer Y-chromosome and mtDNA haplogroups. The average success rates for the paternal and maternal haplogroups were 82% and 90%, respectively (Figure 3). It is very likely that the GenoChip can capture many of the remaining haplogroups that are missing from our reference set, and, for this reason, we are currently conducting a larger genotyping effort to validate them. Overall, we confirmed that GenoChip produces highly accurate results and has broad coverage for SNPs defining Y-chromosome and mtDNA haplogroups. To demonstrate that the GenoChip consists of highly informative SNPs suitable for population genetic studies, we compared its performance to that of the Illumina Human660W and Affymetrix Human Origins arrays. A comparison of the minor allele frequency (MAF) distributions between the three arrays revealed gross differences in allele frequencies (Figure 4, S1). Due to the high frequency of rare alleles in the HapMap dataset, none of the arrays resembled the shape of the HapMap’s MAF distribution. Nonetheless, the commercial arrays were enriched in rare autosomal SNPs compared to the GenoChip. Similar findings were observed for X-chromosome SNPs, although to a lesser extent, as the GenoChip consists of a higher fraction of extremely rare SNPs as well as common SNPs. The different MAF distributions correspond to the choices of SNP ascertainment made in each project. To assess the extent of genetic diversity that can be inferred among human subpopulation by the different arrays, we next compared their F ST distributions [45,46,47]. F ST , measures the differentiation of a subpopulation relative to the total population, and is directly related to the variance in allele frequency between subpopulations, such that a high F ST corresponds to a larger difference between subpopulations [48]. Elhaik [49] used 1 million markers that were genotyped in eight HapMap populations (YRI, LWK, MKK, CEU, TSI, CHB, CHD, and JPT) to carry out a two-level hierarchical F ST analysis. He showed that the greatest proportion of genetic variation occurred within individuals residing in the same populations, with only a small amount (12%) of the total genetic variation being distributed between continental populations and even a lesser amount (1%) between intra-continental populations. An F ST distribution for three continental populations employing 3 million HapMap SNPs yielded an even lower estimate (0.08) to the proportion of genetic variation distributed between continental populations due to the large number of rare alleles [49]. F ST distributions for three continental populations were similarly calculated for GenoChip and the two commercial arrays. Although all F ST distributions were similar in shape to the HapMap F ST distribution, they differed in their means (Figure 5, S2). For autosomes and X-chromosomal SNPs, the mean F ST of both Illumina Human660W and Human Originals were lower than that of the GenoChip due to the high fraction of rare uninformative SNPs in the two commercial arrays. These results may suggest a reduced ability to infer common SNPs that are likely old and correspond to ancient demographic and natural processes [50,51]. The Illumina Human 660W array had the highest fraction of low- F ST alleles, suggesting it is less suitable for population genetic studies compared to the GenoChip and Human Origins. As only half of the Human Origins SNPs could be tested, it is difficult to evaluate its performance. However, we speculate that the large number of rare SNPs, along with those not studied, which are also likely to be rare, may reflect the small number of populations used for its ascertainment and the number of alleles private to these populations. Because the MAF and F ST were not used as filtering criteria for the GenoChip SNPs, we can conclude that its enrichment toward high- F ST SNPs mirrors the success of our ascertainment process, emphasizing its potential for population genetic studies. To summarize, we designed, developed, validated, and tested the GenoChip, the first genotyping chip completely dedicated to genetic anthropology. The GenoChip will help to clarify the genetic relationships between archaic hominins such as Neanderthal and Denisovan, and modern humans, and provide a more detailed understanding of human migration history. We compared the GenoChip to two commercially available arrays and demonstrated the superior ability of the GenoChip to differentiate subpopulations within global data sets. For these reasons, we expect the expanded use of the GenoChip in genetic anthropology research will yield important new insights into the history of our species.
Acknowledgments
We are grateful to David Reich, Nick Patterson, Morten Rasmussen, Robert Hastings, and Dienekes Pontikos for sharing their data with us and for fruitful discussions. We also thank Alon Keinan and the Illumina development team for their feedback and support.
Supplementary Materials
Supplementary Note 1 – attached in this document Supplementary Note 2 – attached in this document Supplementary Table 1 – is available from http://eelhaik.aravindachakravartilab.org/files/TableS1.xlsx 0
Materials and Methods
Genotype Data Retrieval
Genotyped samples for nearly 300 worldwide populations were obtained from 16 public and private collections [7,8,9,10,12,15,35,52,53,54,55,56,57,58,59,60] and the FamilyTreeDNA collection. To study gene flow from apes, ancient hominins, and modern humans, we used the dataset of 257,000 high-quality autosomal SNPs assembled by Reich et al. [61].
Identifying AIMs
Ancestral informative markers (AIMs) were collected using different approaches. To begin with, 55,000 AIMs were harvested from the published literature [2,16,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76] and 12,000 AIMs were obtained from private collections., We also employed two methods, infocalc [30] and
AIMsFinder, to find additional AIMs in our global multi-population panels. Because infocalc [77] does not estimate the minimal number of AIMs necessary to distinguish populations from each other, but rather ranks SNPs by their informativeness for inferring ancestry, we selected the top 1% of infocalc results when applied on our population panels. To find the minimal set of AIMs necessary to distinguish any two populations, we developed
AIMsFinder . AIMsFinder is a novel principal component (PC)-based approach that identifies the most informative set of markers that can distinguish between two populations. To circumvent biases caused by the comparison of uneven number of populations [78,79], we implemented Elhaik’s [37] dual-population framework consisting of three “outgroup” populations that are available in large sample sizes and are the least admixed with each other - Mbuti and Biaka Pygmies (Africa), French Basques (Europe), and Han Chinese (Asia) - and two populations of interest, all of equal sample sizes. This framework minimizes the number of significant PCs to four or fewer (Tracy-Widom test, p <0.01) and maximizes the portion of explained variance to over 20% for the first two PCs. Convex hulls were calculated using Matlab “convhull” function and plotted around the cluster centroids. The relatedness between two populations of interest was estimated by the commensurate overlap of their clusters. The AIMsFinder attempts to find a minimal set of markers that reduces this overlap. Principal components were calculated by first forming a matrix consisting of m subjects (rows) and n SNPs (columns) and sorting each pair of bases alphabetically. Next, the genotype data were transformed to an integer matrix A of the same size, where each entry is encoded as 0, 1, 2, or empty based on the count of the left allele. In other words, if B and B were the bases to appear in the j -th SNP (in alphabetical order), B homozygotes would be encoded as 0, B B
1 would be encoded as 1, and B homozygotes would be set to 2. Missing SNPs were removed from all populations. Using the approach described by Paschou et al. [80], a singular value decomposition (SVD) was applied on A in order to compute its singular vectors and values. The SVD returns m nonnegative singular values s, m pairwise orthonormal eigenvectors u , and n pairwise orthonormal eigenvectors v . The SVD of the m x n matrix A is given by suv T and can be written as a sum of outer products: ∑ = = mi iTii vusA . (1) The left singular vectors ( u i ) are the linear combinations of the columns (SNPs) of the matrix A and are denoted eigenSNPs . In the dual-population framework, the number of significant principal components k corresponding to the eigenSNPs is less than four and for simplicity was set to two. The columns (SNPs) that correlate with the top eigenSNPs can be identified by rearranging Eq. 1 such that the j -th column (SNP) of matrix A (denoted by A j ) can be expressed as ijki iij vusA )( ∑ = ≈ , (2) where ij v is the j -th element of the i -th right singular vector and A j is the linear combination of the top k left singular vectors and corresponding singular values. Finally, SNPs were scored by p j ∑ = = ki ijj vp )( (3) Paschou et al. [80] proposed to use all the inferred SNPs for further analysis, even though such a strategy would yield a very large number of SNPs. AIMsFinder ’s strategy is to rank the SNPs by their score, iteratively select the top 50 SNPs, and test whether using only these SNPs will reduce the overlap between the two populations in question. If so, then the process is repeated with an additional 50 SNPs until either no further reduction is achieved or a threshold of 2,000 SNPs is reached. We carried out these pairwise calculations on all populations within the same continents for which genotype data were available. Less than 10% of these populations reached the threshold. Both infocalc and
AIMsFinder were applied on the multi-population panels independently and collectively, and retrieved nearly 75,000 AIMs. 2
Identifying regions of potential gene flow from archaic hominins into modern humans
Several regions showing potential interbreeding between Neanderthals and Denisovan and modern humans were identified from existing datasets and the literature. First, we obtained ~160,000 SNPs from the USCS table ntSssSNPs [5], in which at least four of six modern human genomes (human reference, San, Yoruba, Han, Papuan, and French) have the derived allele while all observed Neanderthal alleles have the ancestral form. Because these SNPs were highly clustered (e.g., 89,000 SNPs are within 10,000 from one another), we pruned this dataset by filtering SNPs in high LD ( r >0.8) and retaining only common SNPs (MAF >20%). Overall, we collected ~26,000 such SNPs. An additional ~600 SNPs were collected from similar regions identified by Noonan et al. [32]. We further included 200 SNPs within 13 candidate regions for gene flow from Neanderthal to non-African modern humans [5] in which both Neanderthal and Denisovan share the ancestral allele with non-African populations [61]. An additional 200 common (MAF >20%) SNPs embedded within and around these haplotypes, as well as 125 SNPs the X-chromosomal haplotype (B006) and its flanking regions [33], were also included. New candidate regions enriched for alleles shared between Neanderthals and modern humans were detected in the following way. First, we identified all SNPs in which Neanderthals had the derived allele but chimpanzees and Denisovans carried the ancestral allele. Of these SNPs, we filtered out those in which Melanesians carried a lower frequency of the derived allele compared to all other populations. Next, we identified regions enriched for the Melanesian high frequency derived allele by modifying IsoPlotter , originally designed to identify compositionally homogeneous genomic domains [34]. In brief, if SNPs in the regions of interest are marked as 1’s and all other SNPs are marked as 0’s, then the algorithm finds clusters with high frequency of 1’s in an unbiased manner [34]. We retained regions in which the derived alleles were of high frequency (>20%). The same analysis was carried out for the Melanesian low frequency derived allele SNPs. We repeated these analyses for alleles shared between modern humans, Denisovans, and chimpanzees. Overall, ~5,000 SNPs from candidate interbreeding regions were collected, giving us a total of ~30,000 SNPs dedicated to studying gene flow from archaic hominines to modern populations. In this regard, we emphasize that many SNPs are informative for more than one type of analyses so that the actual number of SNPs used for each analysis is higher.
SNP validation
To cross-validate the GenoChip’s autosomal genotypes, we genotyped 168 samples from 14 worldwide populations of the 1000 Genomes Project including: Americans of African ancestry (Southwest USA), Americans of Mexican ancestry (Los Angeles, USA), Americans with Northern and Western European ancestry (Utah, USA), British (England and Scotland), Finnish (Finland), Gujarati Indians (Houston, USA), Han Chinese (Bejing, China), Iberian (Spain), 3 Italians (Tuscany, Italy), Japanese (Tokyo, Japan), Kinh (Ho Chi Minh City, Vietnam), Luhya (Webuye, Kenya), Peruvians (Lima, Peru), and Yoruba (Ibadan, Nigeria). We confirmed that at least one of the alleles matched those reported by the 1000 Genomes Project. The concordance rate between GenoChip and the 1000 Genomes Project genotypes was calculated as the proportion of genotypes that were identical between the two datasets.
Comparing population genetic summary statistics between genotyping arrays
To compare the performances of the validated 130,329 autosomal and X-chromosomal SNPs of the GenoChip array to commercial arrays, we obtained the list of SNPs for the Illumina Human660W-Quad BeadChip (544,366 SNPs) from Illumina and Affymetrix Axiom Human Origins array (627,719 SNPs) available at ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz. Because of the lack of overlap between these genotyping arrays, we used subsets of data calculated for HapMap III populations. MAF and F ST estimates for African, European, and Asians, were obtained from the “continental” HapMap dataset, as described in Elhaik [49]. Briefly, genotype data of 602 unrelated individuals from eight populations (YRI, LWK, MKK, CEU, TSI, CHB, CHD, and JPT) were downloaded from the International HapMap Project web site (phase 3, second draft) [9], passed through rigorous filtering criteria, and finally merged into continental populations (African (288), European (144), and Asian (170)). The final continental data set consisted of SNPs genotyped in at least one population from each continent. We followed Wright’s [45] method to calculate F ST . For each SNP, we calculated the frequencies of both ancestral and derived alleles in each population. We then identified the allele with the smallest global frequency ( P ) when calculated as a weighted average over all populations such that ( 5.00 ≤≤ P ). The frequency of that allele was considered the minor allele frequency. Similarly, the variance of the minor allele frequency P σ was obtained and F ST was calculated as: )1( PPF P −= σ . (4) The MAF and F ST values of the continental dataset for autosomal (2,823,367) and X-chromosomal (86,449) SNPs were compared to those obtained from GenoChip (126,425 and 2,421 SNPs), Illumina Human660W (541,104 and 12,916 SNPs, respectively), and Affymetrix Axiom Human Origins Array (308,949 and 2,984 SNPs, respectively). 4 References F ST . Nat Rev Genet 10: 639-650. 49. Elhaik E (2012) Empirical Distributions of F ST from Large-Scale Human Polymorphism Data. PLoS ONE 7: e49837. 50. Watterson GA, Guess HA (1977) Is the most frequent allele the oldest? Theor Popul Biol 11: 141-160. 51. Kimura M, Ota T (1973) The age of a neutral mutant persisting in a finite population. Genetics 75: 199-212. 52. Xing J, Watkins WS, Witherspoon DJ, Zhang Y, Guthery SL, et al. (2009) Fine-scaled human genetic structure revealed by SNP microarrays. Genome Res 19: 815-825. 53. O'Dushlaine CT, Morris D, Moskvina V, Kirov G, Consortium IS, et al. (2010) Population structure and genome-wide patterns of variation in Ireland and Britain. Eur J Hum Genet. 7 54. Xing J, Watkins WS, Shlien A, Walker E, Huff CD, et al. (2010) Toward a more uniform sampling of human genetic diversity: A survey of worldwide populations by high-density genotyping. Genomics 96: 199-210. 55. Hunter-Zinck H, Musharoff S, Salit J, Al-Ali KA, Chouchane L, et al. (2010) Population genetic structure of the people of Qatar. Am J Hum Genet 87: 17-25. 56. Henn BM, Gignoux CR, Jobin M, Granka JM, Macpherson JM, et al. (2011) Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc Natl Acad Sci U S A 108: 5154-5162. 57. Teo YY, Sim X, Ong RT, Tan AK, Chen J, et al. (2009) Singapore Genome Variation Project: a haplotype map of three Southeast Asian populations. Genome Res 19: 2154-2162. 58. Hatin WI, Nur-Shafawati AR, Zahri MK, Xu S, Jin L, et al. (2011) Population genetic structure of peninsular Malaysia Malay sub-ethnic groups. PLoS One 6: e18312. 59. Chaubey G, Metspalu M, Choi Y, Magi R, Romero IG, et al. (2011) Population genetic structure in Indian Austroasiatic speakers: the role of landscape barriers and sex-specific admixture. Mol Biol Evol 28: 1013-1024. 60. Yunusbayev B, Metspalu M, Järve M, Kutuev I, Rootsi S, et al. (2011) The Caucasus as an asymmetric semipermeable barrier to ancient human migrations. Mol Biol Evol. 61. Reich D, Green RE, Kircher M, Krause J, Patterson N, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468: 1053-1060. 62. McEvoy BP, Montgomery GW, McRae AF, Ripatti S, Perola M, et al. (2009) Geographical structure and differential natural selection among North European populations. Genome Res 19: 804-814. 63. Tian C, Kosoy R, Nassir R, Lee A, Villoslada P, et al. (2009) European population genetic substructure: further definition of ancestry informative markers for distinguishing among diverse European ethnic groups. Mol Med 15: 371-383. 64. Nassir R, Kosoy R, Tian C, White PA, Butler LM, et al. (2009) An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels. BMC Genet 10: 39. 65. Kosoy R, Nassir R, Tian C, White PA, Butler LM, et al. (2009) Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Hum Mutat 30: 69-78. 66. Kidd JR, Friedlaender FR, Speed WC, Pakstis AJ, De La Vega FM, et al. (2011) Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples. Investig Genet 2: 1. 67. Yang N, Li H, Criswell LA, Gregersen PK, Alarcon-Riquelme ME, et al. (2005) Examination of ancestry and ethnic affiliation using highly informative diallelic DNA markers: application to diverse and admixed populations and implications for clinical epidemiology and forensic medicine. Hum Genet 118: 382-392. 8 68. Tian C, Kosoy R, Lee A, Ransom M, Belmont JW, et al. (2008) Analysis of East Asia genetic substructure using genome-wide SNP arrays. PLoS One 3: e3862. 69. Halder I, Shriver M, Thomas M, Fernandez JR, Frudakis T (2008) A panel of ancestry informative markers for estimating individual biogeographical ancestry and admixture from four continents: utility and applications. Hum Mutat 29: 648-658. 70. McEvoy BP, Lind JM, Wang ET, Moyzis RK, Visscher PM, et al. (2010) Whole-genome genetic diversity in a sample of Australians with deep Aboriginal ancestry. Am J Hum Genet 87: 297-305. 71. McEvoy BP, Zhao ZZ, Macgregor S, Bellis C, Lea RA, et al. (2010) European and Polynesian admixture in the Norfolk Island population. Heredity 105: 229-234. 72. Price AL, Patterson N, Yu F, Cox DR, Waliszewska A, et al. (2007) A genomewide admixture map for Latino populations. Am J Hum Genet 80: 1024-1036. 73. Paschou P, Drineas P, Lewis J, Nievergelt CM, Nickerson DA, et al. (2008) Tracing sub-structure in the European American population with PCA-informative markers. PLoS Genet 4: e1000114. 74. Drineas P, Lewis J, Paschou P (2010) Inferring geographic coordinates of origin for Europeans using small panels of ancestry informative markers. PLoS One 5: e11892. 75. Florez JC, Price AL, Campbell D, Riba L, Parra MV, et al. (2009) Strong association of socioeconomic status with genetic ancestry in Latinos: implications for admixture studies of type 2 diabetes. Diabetologia 52: 1528-1536. 76. Paschou P, Lewis J, Javed A, Drineas P (2010) Ancestry informative markers for fine-scale individual assignment to worldwide populations. J Med Genet 47: 835-847. 77. Rosenberg NA, Li LM, Ward R, Pritchard JK (2003) Informativeness of genetic markers for inference of ancestry. Am J Hum Genet 73: 1402-1422. 78. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904-909. 79. McVean G (2009) A genealogical interpretation of principal components analysis. PLoS Genet 5: e1000686. 80. Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, et al. (2007) PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 3: 1672-1686. 9 Figures
Figure 1. Ancestry informative markers from over 450 world populations were harvested from the literature (green) and from public and private collections (red) including over 30 Jewish populations (blue). Coordinates were obtained either directly from the studies [10,12], according to the approximate region reported by the authors, or by the country’s capital city. 0 Figure 2. SNP density in the Genochip. The SNP densities across the genome are color coded to indicate the number of polymorphic SNPs per 100kb. Gaps in the assembly are shown in gray. 1 Figure 3. Success rate in validating SNPs defining Y-chromosomal and mtDNA haplogroups. Figure 4. Minor allele frequency distributions for autosomal (a) and X-chromosomal (b) SNPs.
2 Figure 5. Distribution of locus-specific F ST in three continental populations. F ST values were obtained for (a) HapMap autosomal and (b) X-chromosomal SNPs. These values are compared to two subsets corresponding to Illumina Human660W and GenoChip SNPs. The histograms show bin distribution as indicated on the x-axis and the cumulative distribution (line). 3 Supplementary Figures
Figure S1. Minor allele frequency distributions for autosomal (a) and X-chromosomal (b) SNPs.
Figure S2. Distribution of locus-specific F ST estimates in three continental populations. F ST values were obtained for HapMap (a) autosomal and (b) X-chromosomal SNPs. These values are compared to two subsets corresponding to Human Origins and GenoChip SNPs. The histograms show bin distribution as indicated on the x-axis and the cumulative distribution (line). Supplementary Text 1 - Y-chromosomal SNP inclusion
Figures
Figure S1 . Sources of the Y chromosomal SNPs that have been included on the Illumina Bead Array. The total 15,733 Y markers resemble a merged collection of well-studied core markers (green) and a larger set of Y candidate SNPs (yellow). The pale yellow color indicates that no phylogenetic information has been reported by the authors.
References
1. Karafet, T.M., et al.,
New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree.
Genome Research, 2008. (5): p. 830-8. 2. Rozen, S., et al., Remarkably little variation in proteins encoded by the Y chromosome's single-copy genes, implying effective purifying selection.
American Journal of Human Genetics, 2009. (6): p. 923-8. 3. Cruciani, F., et al., A revised root for the human Y chromosomal phylogenetic tree: the origin of patrilineal diversity in Africa.
American Journal of Human Genetics, 2011. (6): p. 814-8. 4. Sanger, F., S. Nicklen, and A.R. Coulson, DNA sequencing with chain-terminating inhibitors.
Proceedings of the National Academy of Sciences of the United States of America, 1977. (12): p. 5463-7. 5. Roche NimbleGen Inc., NimbleGen Sequence Capture Custom Designs: Guide to Submitting Your Target Sequence, Version 1.0.
NimbleGen Arrays User’s Guide: 454 Optimized Sequence Capture Array Delivery, Version 1.1.
Roche XL+ Sequencing Method Manual for the GS FLX+, Version May 2011.
A map of human genome variation from population-scale sequencing.
Nature, 2010. (7319): p. 1061-1073. upplementary Text 2 – Mitochondrial DNA SNP inclusion
To infer high-resolution mitochondrial DNA (mtDNA) haplogroups, we aimed to include highly informative mtDNA single nucleotide polymorphisms (SNPs) ascertained from a large number of samples. To achieve this goal, we combined the Genbank and the private FamilyTreeDNA databases to assemble a dataset of all 6,409 known mtDNA SNPs observed in 260,782 worldwide samples. These samples include all 1,277 known haplogroups, though this number was later redefined to be 3,552 (Behar et al. 2012). Of these SNPs, 5,589 are located in coding regions and the remaining appear in the two main hypervariable regions (HVRs). All samples were genotyped for the HVR-1 region, 57,227 samples were also genotyped for the HVR-2 region, and 20,000 were genotyped for both HVR regions and the coding region. The next step in the process was designing probes for the chip. The Illumina probe design process requires a minimum of 50 mutation-free (bp) immediately preceding or following the SNP of interest. Similar to the design of PCR primers, a probe has to be unique to the location harboring this SNP. While this process is relatively simple when working with autosomes, the mtDNA presents a thorny challenge due to its immense variability and mutation density. For example, of the total 1,118 positions in the control region, some 73% positions are mutable. This means that each position preceded by 50 SNPs would require 10 probes to allow detection of these variants. Because capturing all mtDNA SNPs would require an imaginary number of probes, we developed an alternative approach to survey the mitochondrial genome for sequencing variation (Figure S2). First, because of the low whole genome amplification hybridization temperature, high confidence levels of probe binding can be achieved with only 20 bases. Therefore, the number of mutation-free DNA bases adjacent to the SNP of interest was reduced from 50 to 20 bp. Second, for each SNP, we calculated the frequency of mutations in our dataset in order to exclude the most extremely rare SNPs. Accordingly, we included only SNPs with a frequency higher than 10 -5 in the coding regions, 3·10 -4 in the HVR-2 region, and 7*10 -5 in the HVR-1 region. Next, we developed a program to automate the probe design based on the Revised Cambridge Reference Sequence (rCRS) (Andrews et al. 1999). As probes can be designed from either side of a SNP, we increase our chances of success by having a different set of probes from each direction of each SNP, unless impossible due to the existence indels or poly-C regions. This layer of redundancy increased the success rate of the GenoChip and allowed us to identify SNPs and haplogroups commonly absent from commercial platform. For example, allele A in position 8,860 (haplogroup H2a2), residing in a highly mutable region, is a definite marker for Western Europeans. We ere able to capture this SNP with nearly 600 probes. Another example involves haplogroup B, which, in being defined by a deletion event of 9-bp starting at position 8,281, is very difficult to capture. To overcome this problem, we designed a probe that identifies a transition event at the next base following the deleted region. Overall, ~32,000 probes were designed for the GenoChip to identify ~3,800 SNPs that capture over 90% of all 3,552 known haplogroups (Behar et al. 2012) (Figure 2). igures Legend Figure S2.
An illustration of the SNP inclusion process. eferences
Andrews, R. M., I. Kubacka, P. F. Chinnery, R. N. Lightowlers, D. M. Turnbull, and N. Howell. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nature Genetics. :147. Behar, D. M., M. van Oven, S. Rosset, M. Metspalu, E. L. Loogvali, N. M. Silva, T. Kivisild, A. Torroni, and R. Villems. 2012. A "Copernican" reassessment of the human mitochondrial DNA tree from its root. American Journal of Human Genetics.90