Fuli Yu
Baylor College of Medicine
Nature | 2010
David Altshuler; Richard A. Gibbs; Leena Peltonen; Emmanouil T. Dermitzakis; Stephen F. Schaffner; Fuli Yu; Penelope E. Bonnen; de Bakker Pi; Panos Deloukas; Stacey Gabriel; R. Gwilliam; Sarah Hunt; Michael Inouye; Xiaoming Jia; Aarno Palotie; Melissa Parkin; Pamela Whittaker; Kyle Chang; Alicia Hawes; Lora Lewis; Yanru Ren; David A. Wheeler; Donna M. Muzny; C. Barnes; Katayoon Darvishi; Joshua M. Korn; Kristiansson K; Cin-Ty A. Lee; McCarrol Sa; James Nemesh
Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called ‘HapMap 3’, includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of ≤5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs. This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.
Nature | 2015
Peter H. Sudmant; Tobias Rausch; Eugene J. Gardner; Robert E. Handsaker; Alexej Abyzov; John Huddleston; Zhang Y; Kai Ye; Goo Jun; Markus His Yang Fritz; Miriam K. Konkel; Ankit Malhotra; Adrian M. Stütz; Xinghua Shi; Francesco Paolo Casale; Jieming Chen; Fereydoun Hormozdiari; Gargi Dayama; Ken Chen; Maika Malig; Mark Chaisson; Klaudia Walter; Sascha Meiers; Seva Kashin; Erik Garrison; Adam Auton; Hugo Y. K. Lam; Xinmeng Jasmine Mu; Can Alkan; Danny Antaki
Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.
Proceedings of the National Academy of Sciences of the United States of America | 2011
Simon Gravel; Brenna M. Henn; Ryan N. Gutenkunst; Amit Indap; Gabor T. Marth; Andrew G. Clark; Fuli Yu; Richard A. Gibbs; Carlos Bustamante
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2–4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.
American Journal of Human Genetics | 2007
Alkes L. Price; Nick Patterson; Fuli Yu; D. R. Cox; Alicja Waliszewska; Gavin J. McDonald; Arti Tandon; Christine Schirmer; Julie Neubauer; Gabriel Bedoya; Constanza Duque; Alberto Villegas; Maria Cátira Bortolini; Francisco M. Salzano; Carla Gallo; Guido Mazzotti; Marcela K. Tello-Ruiz; Laura Riba; Carlos A. Aguilar-Salinas; Samuel Canizales-Quinteros; Marta Menjivar; William Klitz; Brian E. Henderson; Christopher A. Haiman; Cheryl A. Winkler; Teresa Tusié-Luna; Andres Ruiz-Linares; David Reich
Admixture mapping is an economical and powerful approach for localizing disease genes in populations of recently mixed ancestry and has proven successful in African Americans. The method holds equal promise for Latinos, who typically inherit a mix of European, Native American, and African ancestry. However, admixture mapping in Latinos has not been practical because of the lack of a map of ancestry-informative markers validated in Native American and other populations. To address this, we screened multiple databases, containing millions of markers, to identify 4,186 markers that were putatively informative for determining the ancestry of chromosomal segments in Latino populations. We experimentally validated each of these markers in at least 232 new Latino, European, Native American, and African samples, and we selected a subset of 1,649 markers to form an admixture map. An advantage of our strategy is that we focused our map on markers distinguishing Native American from other ancestries and restricted it to markers with very similar frequencies in Europeans and Africans, which decreased the number of markers needed and minimized the possibility of false disease associations. We evaluated the effectiveness of our map for localizing disease genes in four Latino populations from both North and South America.
Science | 2013
Ekta Khurana; Yao Fu; Vincenza Colonna; Xinmeng Jasmine Mu; Hyun Min Kang; Tuuli Lappalainen; Andrea Sboner; Lucas Lochovsky; Jieming Chen; Arif Harmanci; Jishnu Das; Alexej Abyzov; Suganthi Balasubramanian; Kathryn Beal; Dimple Chakravarty; Daniel Challis; Yuan Chen; Declan Clarke; Laura Clarke; Fiona Cunningham; Uday S. Evani; Paul Flicek; Robert Fragoza; Erik Garrison; Richard A. Gibbs; Zeynep H. Gümüş; Javier Herrero; Naoki Kitabayashi; Yong Kong; Kasper Lage
Introduction Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in noncoding regions, is a current challenge. This is especially the case for somatic variants in cancer genomes, a large proportion of which are noncoding. Prioritization of candidate noncoding cancer drivers based on patterns of selection. (Step 1) Filter somatic variants to exclude 1000 Genomes polymorphisms; (2) retain variants in noncoding annotations; (3) retain those in “sensitive” regions; (4) prioritize those disrupting a transcription-factor binding motif and (5) residing near the center of a biological network; (6) prioritize ones in annotation blocks mutated in multiple cancer samples. Methods We investigated patterns of selection in DNA elements from the ENCODE project using the full spectrum of variants from 1092 individuals in the 1000 Genomes Project (Phase 1), including single-nucleotide variants (SNVs), short insertions and deletions (indels), and structural variants (SVs). Although we analyzed broad functional annotations, such as all transcription-factor binding sites, we focused more on highly specific categories such as distal binding sites of factor ZNF274. The greater statistical power of the Phase 1 data set compared with earlier ones allowed us to differentiate the selective constraints on these categories. We also used connectivity information between elements from protein-protein-interaction and regulatory networks. We integrated all the information on selection to develop a workflow (FunSeq) to prioritize personal-genome variants on the basis of their deleterious impact. As a proof of principle, we experimentally validated and characterized a few candidate variants. Results We identified a specific subgroup of noncoding categories with almost as much selective constraint as coding genes: “ultrasensitive” regions. We also uncovered a number of clear patterns of selection. Elements more consistently active across tissues and both maternal and paternal alleles (in terms of allele-specific activity) are under stronger selection. Variants disruptive because of mechanistic effects on transcription-factor binding (i.e. “motif-breakers”) are selected against. Higher network connectivity (i.e. for hubs) is associated with higher constraint. Additionally, many hub promoters and regulatory elements show evidence of recent positive selection. Overall, indels and SVs follow the same pattern as SNVs; however, there are notable exceptions. For instance, enhancers are enriched for SVs formed by nonallelic homologous recombination. We integrated these patterns of selection into the FunSeq prioritization workflow and applied it to cancer variants, because they present a strong contrast to inherited polymorphisms. In particular, application to ~90 cancer genomes (breast, prostate and medulloblastoma) reveals nearly a hundred candidate noncoding drivers. Discussion Our approach can be readily used to prioritize variants in cancer and is immediately applicable in a precision-medicine context. It can be further improved by incorporation of larger-scale population sequencing, better annotations, and expression data from large cohorts. Identifying Important Identifiers Each of us has millions of sequence variations in our genomes. Signatures of purifying or negative selection should help identify which of those variations is functionally important. Khurana et al. (1235587) used sequence polymorphisms from 1092 humans across 14 populations to identify patterns of selection, especially in noncoding regulatory regions. Noncoding regions under very strong negative selection included binding sites of some chromatin and general transcription factors (TFs) and core motifs of some important TF families. Positive selection in TF binding sites tended to occur in network hub promoters. Many recurrent somatic cancer variants occurred in noncoding regulatory regions and thus might indicate mutations that drive cancer. Regions under strong selection in the human genome identify noncoding regulatory elements with possible roles in disease. Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations (“ultrasensitive”) and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, “motif-breakers”). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.
Genome Biology | 2011
Gabor T. Marth; Fuli Yu; Amit Indap; Kiran Garimella; Simon Gravel; Wen Fung Leong; Chris Tyler-Smith; Matthew N. Bainbridge; Thomas W. Blackwell; Xiangqun Zheng-Bradley; Yuan Chen; Danny Challis; Laura Clarke; Edward V. Ball; Kristian Cibulskis; David Neil Cooper; Bob Fulton; Chris Hartl; Dan Koboldt; Donna M. Muzny; Richard Smith; Carrie Sougnez; Chip Stewart; Alistair Ward; Jin Yu; Yali Xue; David Altshuler; Carlos Bustamante; Andrew G. Clark; Mark J. Daly
BackgroundRare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.ResultsThe 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.ConclusionsThis study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
Genome Research | 2010
Yufeng Shen; Zhengzheng Wan; Cristian Coarfa; Rafal Drabek; Lei Chen; Elizabeth A. Ostrowski; Yue Liu; George M. Weinstock; David A. Wheeler; Richard A. Gibbs; Fuli Yu
Accurate identification of genetic variants from next-generation sequencing (NGS) data is essential for immediate large-scale genomic endeavors such as the 1000 Genomes Project, and is crucial for further genetic analysis based on the discoveries. The key challenge in single nucleotide polymorphism (SNP) discovery is to distinguish true individual variants (occurring at a low frequency) from sequencing errors (often occurring at frequencies orders of magnitude higher). Therefore, knowledge of the error probabilities of base calls is essential. We have developed Atlas-SNP2, a computational tool that detects and accounts for systematic sequencing errors caused by context-related variables in a logistic regression model learned from training data sets. Subsequently, it estimates the posterior error probability for each substitution through a Bayesian formula that integrates prior knowledge of the overall sequencing error probability and the estimated SNP rate with the results from the logistic regression model for the given substitutions. The estimated posterior SNP probability can be used to distinguish true SNPs from sequencing errors. Validation results show that Atlas-SNP2 achieves a false-positive rate of lower than 10%, with an approximately 5% or lower false-negative rate.
BMC Bioinformatics | 2014
Jeffrey G. Reid; Andrew Carroll; Narayanan Veeraraghavan; Mahmoud Dahdouli; Andreas Sundquist; Adam C English; Matthew N. Bainbridge; Simon White; William Salerno; Christian Buhay; Fuli Yu; Donna M. Muzny; Richard Daly; Geoff Duyk; Richard A. Gibbs; Eric Boerwinkle
BackgroundMassively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results.ResultsTo address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts.ConclusionsBy taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.
Nature Genetics | 2013
Alanna C. Morrison; Arend Voorman; Andrew D. Johnson; Xiaoming Liu; Jin Yu; Alexander H. Li; Donna M. Muzny; Fuli Yu; Kenneth Rice; Chengsong Zhu; Joshua C. Bis; Gerardo Heiss; Christopher J. O’Donnell; Bruce M. Psaty; L. Adrienne Cupples; Richard A. Gibbs; Eric Boerwinkle
We describe initial steps for interrogating whole-genome sequence data to characterize the genetic architecture of a complex trait, levels of high-density lipoprotein cholesterol (HDL-C). We report whole-genome sequencing and analysis of 962 individuals from the Cohorts for Heart and Aging Research in Genetic Epidemiology (CHARGE) studies. From this analysis, we estimate that common variation contributes more to heritability of HDL-C levels than rare variation, and screening for mendelian variants for dyslipidemia identified individuals with extreme HDL-C levels. Whole-genome sequencing analyses highlight the value of regulatory and non-protein-coding regions of the genome in addition to protein-coding regions.
Diabetologia | 2009
Jose C. Florez; Alkes L. Price; Desmond D. Campbell; Laura Riba; María Victoria Parra; Fuli Yu; Constanza Duque; Richa Saxena; Natalia Gallego; Marcela K. Tello-Ruiz; Liliana Franco; Maribel Rodríguez-Torres; Alberto Villegas; Gabriel Bedoya; Carlos A. Aguilar-Salinas; María Teresa Tusié-Luna; Andres Ruiz-Linares; David Reich
Aims/hypothesisType 2 diabetes is more prevalent in US American minority populations of African or Native American descent than it is in European Americans. However, the proportion of this epidemiological difference that can be ascribed to genetic or environmental factors is unknown. To determine whether genetic ancestry is correlated with diabetes risk in Latinos, we estimated the proportion of European ancestry in case–control samples from Mexico and Colombia in whom socioeconomic status had been carefully ascertained.MethodsWe genotyped 67 ancestry-informative markers in 499 participants with type 2 diabetes and 197 controls from Medellín (Colombia), as well as in 163 participants with type 2 diabetes and 72 controls from central Mexico. Each participant was assigned a socioeconomic status scale via various measures.ResultsAlthough European ancestry was associated with lower diabetes risk in Mexicans (OR [95% CI] 0.06 [0.02–0.21], p = 2.0 × 10−5) and Colombians (OR 0.26 [0.08–0.78], p = 0.02), adjustment for socioeconomic status eliminated the association in the Colombian sample (OR 0.64 [0.19–2.12], p = 0.46) and significantly attenuated it in the Mexican sample (OR 0.17 [0.04–0.71], p = 0.02). Adjustment for BMI did not change the results.Conclusions/interpretationThe proportion of non-European ancestry is associated with both type 2 diabetes and lower socioeconomic status in admixed Latino populations from North and South America. We conclude that ancestry-directed search for genetic markers associated with type 2 diabetes in Latinos may benefit from information involving social factors, as these factors have a quantitatively important effect on type 2 diabetes risk relative to ancestry effects.