Is this you? Create Your Porfile

Yun S. Song

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yun S. Song is active.

Explore More

Publication

Featured researches published by Yun S. Song.

Nature Reviews Genetics | 2011

Genotype and SNP calling from next-generation sequencing data

Rasmus Nielsen; Joshua S. Paul; Anders Albrechtsen; Yun S. Song

Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data. We review these methods and provide a guide for their use in NGS studies.

Genetics | 2012

Genomic Variation in Natural Populations of Drosophila melanogaster

Charles H. Langley; Kristian A. Stevens; Charis Cardeno; Yuh Chwen G. Lee; Daniel R. Schrider; John E. Pool; Sasha A. Langley; Charlyn Suarez; Russell Corbett-Detig; Bryan Kolaczkowski; Shu Fang; Phillip M. Nista; Alisha K. Holloway; Andrew D. Kern; Colin N. Dewey; Yun S. Song; Matthew W. Hahn; David J. Begun

This report of independent genome sequences of two natural populations of Drosophila melanogaster (37 from North America and 6 from Africa) provides unique insight into forces shaping genomic polymorphism and divergence. Evidence of interactions between natural selection and genetic linkage is abundant not only in centromere- and telomere-proximal regions, but also throughout the euchromatic arms. Linkage disequilibrium, which decays within 1 kbp, exhibits a strong bias toward coupling of the more frequent alleles and provides a high-resolution map of recombination rate. The juxtaposition of population genetics statistics in small genomic windows with gene structures and chromatin states yields a rich, high-resolution annotation, including the following: (1) 5′- and 3′-UTRs are enriched for regions of reduced polymorphism relative to lineage-specific divergence; (2) exons overlap with windows of excess relative polymorphism; (3) epigenetic marks associated with active transcription initiation sites overlap with regions of reduced relative polymorphism and relatively reduced estimates of the rate of recombination; (4) the rate of adaptive nonsynonymous fixation increases with the rate of crossing over per base pair; and (5) both duplications and deletions are enriched near origins of replication and their density correlates negatively with the rate of crossing over. Available demographic models of X and autosome descent cannot account for the increased divergence on the X and loss of diversity associated with the out-of-Africa migration. Comparison of the variation among these genomes to variation among genomes from D. simulans suggests that many targets of directional selection are shared between these species.

Nature | 2016

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

Swapan Mallick; Heng Li; Mark Lipson; Iain Mathieson; Melissa Gymrek; Fernando Racimo; Mengyao Zhao; Niru Chennagiri; Arti Tandon; Pontus Skoglund; Iosif Lazaridis; Sriram Sankararaman; Qiaomei Fu; Nadin Rohland; Gabriel Renaud; Yaniv Erlich; Thomas Willems; Carla Gallo; Jeffrey P. Spence; Yun S. Song; Giovanni Poletti; Francois Balloux; George van Driem; Peter de Knijff; Irene Gallego Romero; Aashish R. Jha; Doron M. Behar; Claudio M. Bravi; Cristian Capelli; Tor Hervig

Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.

Science | 2015

Genomic evidence for the Pleistocene and recent population history of Native Americans

Maanasa Raghavan; Matthias Steinrücken; Kelley Harris; Stephan Schiffels; Simon Rasmussen; Michael DeGiorgio; Anders Albrechtsen; Cristina Valdiosera; María C. Ávila-Arcos; Anna-Sapfo Malaspinas; Anders Eriksson; Ida Moltke; Mait Metspalu; Julian R. Homburger; Jeffrey D. Wall; Omar E. Cornejo; J. Víctor Moreno-Mayar; Thorfinn Sand Korneliussen; Tracey Pierre; Morten Rasmussen; Paula F. Campos; Peter de Barros Damgaard; Morten E. Allentoft; John Lindo; Ene Metspalu; Ricardo Rodríguez-Varela; Josefina Mansilla; Celeste Henrickson; Andaine Seguin-Orlando; Helena Malmström

Genetic history of Native Americans Several theories have been put forth as to the origin and timing of when Native American ancestors entered the Americas. To clarify this controversy, Raghavan et al. examined the genomic variation among ancient and modern individuals from Asia and the Americas. There is no evidence for multiple waves of entry or recurrent gene flow with Asians in northern populations. The earliest migrations occurred no earlier than 23,000 years ago from Siberian ancestors. Amerindians and Athabascans originated from a single population, splitting approximately 13,000 years ago. Science, this issue 10.1126/science.aab3884 Genetic variation within ancient and extant Native American populations informs on their migration into the Americas. INTRODUCTION The consensus view on the peopling of the Americas is that ancestors of modern Native Americans entered the Americas from Siberia via the Bering Land Bridge and that this occurred at least ~14.6 thousand years ago (ka). However, the number and timing of migrations into the Americas remain controversial, with conflicting interpretations based on anatomical and genetic evidence. RATIONALE In this study, we address four major unresolved issues regarding the Pleistocene and recent population history of Native Americans: (i) the timing of their divergence from their ancestral group, (ii) the number of migrations into the Americas, (iii) whether there was ~15,000 years of isolation of ancestral Native Americans in Beringia (Beringian Incubation Model), and (iv) whether there was post-Pleistocene survival of relict populations in the Americas related to Australo-Melanesians, as suggested by apparent differences in cranial morphologies between some early (“Paleoamerican”) remains and those of more recent Native Americans. We generated 31 high-coverage modern genomes from the Americas, Siberia, and Oceania; 23 ancient genomic sequences from the Americas dating between ~0.2 and 6 ka; and SNP chip genotype data from 79 present-day individuals belonging to 28 populations from the Americas and Siberia. The above data sets were analyzed together with published modern and ancient genomic data from worldwide populations, after masking some present-day Native Americans for recent European admixture. RESULTS Using three different methods, we determined the divergence time for all Native Americans (Athabascans and Amerindians) from their Siberian ancestors to be ~20 ka, and no earlier than ~23 ka. Furthermore, we dated the divergence between Athabascans (northern Native American branch, together with northern North American Amerindians) and southern North Americans and South and Central Americans (southern Native American branch) to be ~13 ka. Similar divergence times from East Asian populations and a divergence time between the two branches that is close in age to the earliest well-established archaeological sites in the Americas suggest that the split between the branches occurred within the Americas. We additionally found that several sequenced Holocene individuals from the Americas are related to present-day populations from the same geographical regions, implying genetic continuity of ancient and modern populations in some parts of the Americas over at least the past 8500 years. Moreover, our results suggest that there has been gene flow between some Native Americans from both North and South America and groups related to East Asians and Australo-Melanesians, the latter possibly through an East Asian route that might have included ancestors of modern Aleutian Islanders. Last, using both genomic and morphometric analyses, we found that historical Native American groups such as the Pericúes and Fuego-Patagonians were not “relicts” of Paleoamericans, and hence, our results do not support an early migration of populations directly related to Australo-Melanesians into the Americas. CONCLUSION Our results provide an upper bound of ~23 ka on the initial divergence of ancestral Native Americans from their East Asian ancestors, followed by a short isolation period of no more than ~8000 years, and subsequent entrance and spread across the Americas. The data presented are consistent with a single-migration model for all Native Americans, with later gene flow from sources related to East Asians and, indirectly, Australo-Melanesians. The single wave diversified ~13 ka, likely within the Americas, giving rise to the northern and southern branches of present-day Native Americans. Population history of present-day Native Americans. The ancestors of all Native Americans entered the Americas as a single migration wave from Siberia (purple) no earlier than ~23 ka, separate from the Inuit (green), and diversified into “northern” and “southern” Native American branches ~13 ka. There is evidence of post-divergence gene flow between some Native Americans and groups related to East Asians/Inuit and Australo-Melanesians (yellow). How and when the Americas were populated remains contentious. Using ancient and modern genome-wide data, we found that the ancestors of all present-day Native Americans, including Athabascans and Amerindians, entered the Americas as a single migration wave from Siberia no earlier than 23 thousand years ago (ka) and after no more than an 8000-year isolation period in Beringia. After their arrival to the Americas, ancestral Native Americans diversified into two basal genetic branches around 13 ka, one that is now dispersed across North and South America and the other restricted to North America. Subsequent gene flow resulted in some Native Americans sharing ancestry with present-day East Asians (including Siberians) and, more distantly, Australo-Melanesians. Putative “Paleoamerican” relict populations, including the historical Mexican Pericúes and South American Fuego-Patagonians, are not directly related to modern Australo-Melanesians as suggested by the Paleoamerican Model.

Genetics | 2005

The Hitchhiking Effect on Linkage Disequilibrium Between Linked Neutral Loci

Wolfgang Stephan; Yun S. Song; Charles H. Langley

We analyzed a three-locus model of genetic hitchhiking with one locus experiencing positive directional selection and two partially linked neutral loci. Following the original hitchhiking approach by Maynard Smith and Haigh, our analysis is purely deterministic. In the first half of the selected phase after a favored mutation has entered the population, hitchhiking may lead to a strong increase of linkage disequilibrium (LD) between the two neutral sites if both are <0.1s away from the selected site (where s is the selection coefficient). In the second half of the selected phase, the main effect of hitchhiking is to destroy LD. This occurs very quickly (before the end of the selected phase) when the selected site is between both neutral loci. This pattern cannot be attributed to the well-known variation-reducing effect of hitchhiking but is a consequence of secondary hitchhiking effects on the recombinants created in the selected phase. When the selected site is outside the neutral loci (which are, say, <0.1s apart), however, a fast decay of LD is observed only if the selected site is in the immediate neighborhood of one of the neutral sites (i.e., if the recombination rate r between the selected site and one of the neutral sites satisfies \batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(r{\ll}0.1s\) \end{document}). If the selected site is far away from the neutral sites (say, r > 0.3s), the decay rate of LD approaches that of neutrality. Averaging over a uniform distribution of initial gamete frequencies shows that the expected LD at the end of the hitchhiking phase is driven toward zero, while the variance is increased when the selected site is well outside the two neutral sites. When the direction of LD is polarized with respect to the more common allele at each neutral site, hitchhiking creates more positive than negative linkage disequilibrium. Thus, hitchhiking may have a distinctively patterned LD-reducing effect, in particular near the target of selection.

Genetics | 2013

Estimating variable effective population sizes from multiple genomes: a sequentially markov conditional sampling distribution approach.

Sara Sheehan; Kelley Harris; Yun S. Song

Throughout history, the population size of modern humans has varied considerably due to changes in environment, culture, and technology. More accurate estimates of population size changes, and when they occurred, should provide a clearer picture of human colonization history and help remove confounding effects from natural selection inference. Demography influences the pattern of genetic variation in a population, and thus genomic data of multiple individuals sampled from one or more present-day populations contain valuable information about the past demographic history. Recently, Li and Durbin developed a coalescent-based hidden Markov model, called the pairwise sequentially Markovian coalescent (PSMC), for a pair of chromosomes (or one diploid individual) to estimate past population sizes. This is an efficient, useful approach, but its accuracy in the very recent past is hampered by the fact that, because of the small sample size, only few coalescence events occur in that period. Multiple genomes from the same population contain more information about the recent past, but are also more computationally challenging to study jointly in a coalescent framework. Here, we present a new coalescent-based method that can efficiently infer population size changes from multiple genomes, providing access to a new store of information about the recent past. Our work generalizes the recently developed sequentially Markov conditional sampling distribution framework, which provides an accurate approximation of the probability of observing a newly sampled haplotype given a set of previously sampled haplotypes. Simulation results demonstrate that we can accurately reconstruct the true population histories, with a significant improvement over the PSMC in the recent past. We apply our method, called diCal, to the genomes of multiple human individuals of European and African ancestry to obtain a detailed population size change history during recent times.

Genome Research | 2011

ECHO: A reference-free short-read error correction algorithm

Wei-Chun Kao; Andrew H. Chan; Yun S. Song

Developing accurate, scalable algorithms to improve data quality is an important computational challenge associated with recent advances in high-throughput sequencing technology. In this study, a novel error-correction algorithm, called ECHO, is introduced for correcting base-call errors in short-reads, without the need of a reference genome. Unlike most previous methods, ECHO does not require the user to specify parameters of which optimal values are typically unknown a priori. ECHO automatically sets the parameters in the assumed model and estimates error characteristics specific to each sequencing run, while maintaining a running time that is within the range of practical use. ECHO is based on a probabilistic model and is able to assign a quality score to each corrected base. Furthermore, it explicitly models heterozygosity in diploid genomes and provides a reference-free method for detecting bases that originated from heterozygous sites. On both real and simulated data, ECHO is able to improve the accuracy of previous error-correction methods by several folds to an order of magnitude, depending on the sequence coverage depth and the position in the read. The improvement is most pronounced toward the end of the read, where previous methods become noticeably less effective. Using a whole-genome yeast data set, it is demonstrated here that ECHO is capable of coping with nonuniform coverage. Also, it is shown that using ECHO to perform error correction as a preprocessing step considerably facilitates de novo assembly, particularly in the case of low-to-moderate sequence coverage depth.

PLOS Genetics | 2012

Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster

Andrew H. Chan; Paul A. Jenkins; Yun S. Song

Estimating fine-scale recombination maps of Drosophila from population genomic data is a challenging problem, in particular because of the high background recombination rate. In this paper, a new computational method is developed to address this challenge. Through an extensive simulation study, it is demonstrated that the method allows more accurate inference, and exhibits greater robustness to the effects of natural selection and noise, compared to a well-used previous method developed for studying fine-scale recombination rate variation in the human genome. As an application, a genome-wide analysis of genetic variation data is performed for two Drosophila melanogaster populations, one from North America (Raleigh, USA) and the other from Africa (Gikongoro, Rwanda). It is shown that fine-scale recombination rate variation is widespread throughout the D. melanogaster genome, across all chromosomes and in both populations. At the fine-scale, a conservative, systematic search for evidence of recombination hotspots suggests the existence of a handful of putative hotspots each with at least a tenfold increase in intensity over the background rate. A wavelet analysis is carried out to compare the estimated recombination maps in the two populations and to quantify the extent to which recombination rates are conserved. In general, similarity is observed at very broad scales, but substantial differences are seen at fine scales. The average recombination rate of the X chromosome appears to be higher than that of the autosomes in both populations, and this pattern is much more pronounced in the African population than the North American population. The correlation between various genomic features—including recombination rates, diversity, divergence, GC content, gene content, and sequence quality—is examined using the wavelet analysis, and it is shown that the most notable difference between D. melanogaster and humans is in the correlation between recombination and diversity.

Journal of Computational Biology | 2005

Constructing Minimal Ancestral Recombination Graphs

Yun S. Song; Jotun Hein

By viewing the ancestral recombination graph as defining a sequence of trees, we show how possible evolutionary histories consistent with given data can be constructed using the minimum number of recombination events. In contrast to previously known methods, which yield only estimated lower bounds, our method of detecting recombination always gives the minimum number of recombination events if the right kind of rooted trees are used in our algorithm. A new lower bound can be defined if rooted trees with fewer constraints are used. As well as studying how often it actually is equal to the minimum, we test how this new lower bound performs in comparison to some other lower bounds. Our study indicates that the new lower bound is an improvement on earlier bounds. Also, using simulated data, we investigate how well our method can recover the actual site-specific evolutionary relationships. In the presence of recombination, using a single tree to describe the evolution of the entire locus clearly leads to lower average recovery percentages than does our method. Our study shows that recovering the actual local tree topologies can be done more accurately than estimating the actual number of recombination events.

Genome Research | 2009

BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing

Wei-Chun Kao; Kristian A. Stevens; Yun S. Song

Extracting sequence information from raw images of fluorescence is the foundation underlying several high-throughput sequencing platforms. Some of the main challenges associated with this technology include reducing the error rate, assigning accurate base-specific quality scores, and reducing the cost of sequencing by increasing the throughput per run. To demonstrate how computational advancement can help to meet these challenges, a novel model-based base-calling algorithm, BayesCall, is introduced for the Illumina sequencing platform. Being founded on the tools of statistical learning, BayesCall is flexible enough to incorporate various features of the sequencing process. In particular, it can easily incorporate time-dependent parameters and model residual effects. This new approach significantly improves the accuracy over Illuminas base-caller Bustard, particularly in the later cycles of a sequencing run. For 76-cycle data on a standard viral sample, phiX174, BayesCall improves Bustards average per-base error rate by approximately 51%. The probability of observing each base can be readily computed in BayesCall, and this probability can be transformed into a useful base-specific quality score with a high discrimination ability. A detailed study of BayesCalls performance is presented here.

Explore More