Anand Bhaskar
University of California, Berkeley
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Anand Bhaskar.
PLOS Genetics | 2010
Ivan Liachko; Anand Bhaskar; Chanmi Lee; Shau Chee Claire Chung; Bik Kwoon Tye; Uri Keich
Eukaryotic chromosomes initiate DNA synthesis from multiple replication origins. The machinery that initiates DNA synthesis is highly conserved, but the sites where the replication initiation proteins bind have diverged significantly. Functional comparative genomics is an obvious approach to study the evolution of replication origins. However, to date, the Saccharomyces cerevisiae replication origin map is the only genome map available. Using an iterative approach that combines computational prediction and functional validation, we have generated a high-resolution genome-wide map of DNA replication origins in Kluyveromyces lactis. Unlike other yeasts or metazoans, K. lactis autonomously replicating sequences (KlARSs) contain a 50 bp consensus motif suggestive of a dimeric structure. This motif is necessary and largely sufficient for initiation and was used to dependably identify 145 of the up to 156 non-repetitive intergenic ARSs projected for the K. lactis genome. Though similar in genome sizes, K. lactis has half as many ARSs as its distant relative S. cerevisiae. Comparative genomic analysis shows that ARSs in K. lactis and S. cerevisiae preferentially localize to non-syntenic intergenic regions, linking ARSs with loci of accelerated evolutionary change.
Genome Research | 2015
Anand Bhaskar; Y. X. Rachel Wang; Yun S. Song
With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions.
Proceedings of the National Academy of Sciences of the United States of America | 2014
Anand Bhaskar; Andrew G. Clark; Yun S. Song
Significance Sample sizes in population genomic studies are rapidly increasing to the point where assumptions underlying analytical tools may be violated. This theoretical work examines the accuracy of a widely used probabilistic model, called the coalescent, for describing the ancestry of a sample of individuals. A method for performing exact computation of various genealogical quantities is developed here, and it is shown that the coalescent prediction of rare variants can be noticeably inaccurate when the sample is very large. A hybrid algorithm, which combines discrete- and continuous-time models, is proposed to balance the trade-off between accuracy and computational efficiency. Study sample sizes in human genetics are growing rapidly, and in due course it will become routine to analyze samples with hundreds of thousands, if not millions, of individuals. In addition to posing computational challenges, such large sample sizes call for carefully reexamining the theoretical foundation underlying commonly used analytical tools. Here, we study the accuracy of the coalescent, a central model for studying the ancestry of a sample of individuals. The coalescent arises as a limit of a large class of random mating models, and it is an accurate approximation to the original model provided that the population size is sufficiently larger than the sample size. We develop a method for performing exact computation in the discrete-time Wright–Fisher (DTWF) model and compare several key genealogical quantities of interest with the coalescent predictions. For recently inferred demographic scenarios, we find that there are a significant number of multiple- and simultaneous-merger events under the DTWF model, which are absent in the coalescent by construction. Furthermore, for large sample sizes, there are noticeable differences in the expected number of rare variants between the coalescent and the DTWF model. To balance the trade-off between accuracy and computational efficiency, we propose a hybrid algorithm that uses the DTWF model for the recent past and the coalescent for the more distant past. Our results demonstrate that the hybrid method with only a handful of generations of the DTWF model leads to a frequency spectrum that is quite close to the prediction of the full DTWF model.
The Annals of Applied Statistics | 2014
Matthias Steinrücken; Anand Bhaskar; Yun S. Song
The increased availability of time series genetic variation data from experimental evolution studies and ancient DNA samples has created new opportunities to identify genomic regions under selective pressure and to estimate their associated fitness parameters. However, it is a challenging problem to compute the likelihood of non-neutral models for the population allele frequency dynamics, given the observed temporal DNA data. Here, we develop a novel spectral algorithm to analytically and efficiently integrate over all possible frequency trajectories between consecutive time points. This advance circumvents the limitations of existing methods which require fine-tuning the discretization of the population allele frequency space when numerically approximating requisite integrals. Furthermore, our method is flexible enough to handle general diploid models of selection where the heterozygote and homozygote fitness parameters can take any values, while previous methods focused on only a few restricted models of selection. We demonstrate the utility of our method on simulated data and also apply it to analyze ancient DNA data from genetic loci associated with coat coloration in horses. In contrast to previous studies, our exploration of the full fitness parameter space reveals that a heterozygote-advantage form of balancing selection may have been acting on these loci.
Advances in Applied Probability | 2012
Anand Bhaskar; John A. Kamm; Yun S. Song
Many applications in genetic analyses utilize sampling distributions, which describe the probability of observing a sample of DNA sequences randomly drawn from a population. In the one-locus case with special models of mutation, such as the infinite-alleles model or the finite-alleles parent-independent mutation model, closed-form sampling distributions under the coalescent have been known for many decades. However, no exact formula is currently known for more general models of mutation that are of biological interest. In this paper, models with finitely-many alleles are considered, and an urn construction related to the coalescent is used to derive approximate closed-form sampling formulae for an arbitrary irreducible recurrent mutation model or for a reversible recurrent mutation model, depending on whether the number of distinct observed allele types is at most three or four, respectively. It is demonstrated empirically that the formulae derived here are highly accurate when the per-base mutation rate is low, which holds for many biological organisms.
Advances in Applied Probability | 2012
Anand Bhaskar; Yun S. Song
Obtaining a closed-form sampling distribution for the coalescent with recombination is a challenging problem. In the case of two loci, a new framework based on an asymptotic series has recently been developed to derive closed-form results when the recombination rate is moderate to large. In this paper, an arbitrary number of loci is considered and combinatorial approaches are employed to find closed-form expressions for the first couple of terms in an asymptotic expansion of the multi-locus sampling distribution. These expressions are universal in the sense that their functional form in terms of the marginal one-locus distributions applies to all finite- and infinite-alleles models of mutation.
Bioinformatics | 2009
Anand Bhaskar; Yun S. Song
Motivation: A fundamental problem in population genetics, which being also of importance to forensic science, is to compute the match probability (MP) that two individuals randomly chosen from a population have identical alleles at a collection of loci. At present, 11–13 unlinked autosomal microsatellite loci are typed for forensic use. In a finite population, the genealogical relationships of individuals can create statistical non-independence of alleles at unlinked loci. However, the so-called product rule, which is used in courts in the USA, computes the MP for multiple unlinked loci by assuming statistical independence, multiplying the one-locus MPs at those loci. Analytically testing the accuracy of the product rule for more than five loci has hitherto remained an open problem. Results: In this article, we adopt a flexible graphical framework to compute multi-locus MPs analytically. We consider two standard models of random mating, namely the Wright–Fisher (WF) and Moran models. We succeed in computing haplotypic MPs for up to 10 loci in the WF model, and up to 13 loci in the Moran model. For a finite population and a large number of loci, we show that the MPs predicted by the product rule are highly sensitive to mutation rates in the range of interest, while the true MPs computed using our graphical framework are not. Furthermore, we show that the WF and Moran models may produce drastically different MPs for a finite population, and that this difference grows with the number of loci and mutation rates. Although the two models converge to the same coalescent or diffusion limit, in which the population size approaches infinity, we demonstrate that, when multiple loci are considered, the rate of convergence in the Moran model is significantly slower than that in the WF model. Availability: A C++ implementation of the algorithms discussed in this article is available at http://www.cs.berkeley.edu/∼yss/software.html. Contact: [email protected]
PLOS Genetics | 2016
Arbel Harpak; Anand Bhaskar; Jonathan K. Pritchard
The site frequency spectrum (SFS) has long been used to study demographic history and natural selection. Here, we extend this summary by examining the SFS conditional on the alleles found at the same site in other species. We refer to this extension as the “phylogenetically-conditioned SFS” or cSFS. Using recent large-sample data from the Exome Aggregation Consortium (ExAC), combined with primate genome sequences, we find that human variants that occurred independently in closely related primate lineages are at higher frequencies in humans than variants with parallel substitutions in more distant primates. We show that this effect is largely due to sites with elevated mutation rates causing significant departures from the widely-used infinite sites mutation model. Our analysis also suggests substantial variation in mutation rates even among mutations involving the same nucleotide changes. In summary, we show that variable mutation rates are key determinants of the SFS in humans.
international conference on management of data | 2006
Anand Bhaskar; Chavdar Botev; Muthiah M. Muthaia Chettiar; Lin Guo; Jayavel Shanmugasundaram; Feng Shao; Fan Yang
The XQuery 1.0 and XPath 2.0 Full-text (XQFT) language has been developed by the W3C to extend XQuery and XPath with full-text search capabilities. XQFT allows users to specify a mix of structured and complex full-text predicates, and also allows users to score/rank such queries. The power and flexibility of XQFT gives rise to two interesting questions. First, is it possible to efficiently integrate a full-function XML query language with sophisticated full-text search? Second, is it possible to score and rank arbitrary XQuery and XQFT queries? In this demonstration, we present evidence that it is indeed possible to achieve the above goals. We demonstrate the Quark open-source data management system and show how we can seamlessly and efficiently integrate structured and unstructured search over XML data. In particular, we demonstrate (a) techniques for efficiently evaluating keyword search over virtual XML views, and (b) a framework for scoring both structured and full-text predicates.
American Journal of Human Genetics | 2017
Diego Calderon; Anand Bhaskar; David Knowles; David Golan; Towfique Raj; Audrey Qiuyan Fu; Jonathan K. Pritchard
Previous studies have prioritized trait-relevant cell types by looking for an enrichment of genome-wide association study (GWAS) signal within functional regions. However, these studies are limited in cell resolution by the lack of functional annotations from difficult-to-characterize or rare cell populations. Measurement of single-cell gene expression has become a popular method for characterizing novel cell types, and yet limited work has linked single-cell RNA sequencing (RNA-seq) to phenotypes of interest. To address this deficiency, we present RolyPoly, a regression-based polygenic model that can prioritize trait-relevant cell types and genes from GWAS summary statistics and gene expression data. RolyPoly is designed to use expression data from either bulk tissue or single-cell RNA-seq. In this study, we demonstrated RolyPolys accuracy through simulation and validated previously known tissue-trait associations. We discovered a significant association between microglia and late-onset Alzheimer disease and an association between schizophrenia and oligodendrocytes and replicating fetal cortical cells. Additionally, RolyPoly computes a trait-relevance score for each gene to reflect the importance of expression specific to a cell type. We found that differentially expressed genes in the prefrontal cortex of individuals with Alzheimer disease were significantly enriched with genes ranked highly by RolyPoly gene scores. Overall, our method represents a powerful framework for understanding the effect of common variants on cell types contributing to complex traits.