Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yael Baran is active.

Publication


Featured researches published by Yael Baran.


Bioinformatics | 2012

Fast and accurate inference of local ancestry in Latino populations

Yael Baran; Bogdan Pasaniuc; Sriram Sankararaman; Dara G. Torgerson; Christopher R. Gignoux; Celeste Eng; William Rodriguez-Cintron; Rocio Chapela; Jean G. Ford; Pedro C. Avila; Jose R. Rodriguez-Santana; Esteban G. Burchard; Eran Halperin

MOTIVATION It is becoming increasingly evident that the analysis of genotype data from recently admixed populations is providing important insights into medical genetics and population history. Such analyses have been used to identify novel disease loci, to understand recombination rate variation and to detect recent selection events. The utility of such studies crucially depends on accurate and unbiased estimation of the ancestry at every genomic locus in recently admixed populations. Although various methods have been proposed and shown to be extremely accurate in two-way admixtures (e.g. African Americans), only a few approaches have been proposed and thoroughly benchmarked on multi-way admixtures (e.g. Latino populations of the Americas). RESULTS To address these challenges we introduce here methods for local ancestry inference which leverage the structure of linkage disequilibrium in the ancestral population (LAMP-LD), and incorporate the constraint of Mendelian segregation when inferring local ancestry in nuclear family trios (LAMP-HAP). Our algorithms uniquely combine hidden Markov models (HMMs) of haplotype diversity within a novel window-based framework to achieve superior accuracy as compared with published methods. Further, unlike previous methods, the structure of our HMM does not depend on the number of reference haplotypes but on a fixed constant, and it is thereby capable of utilizing large datasets while remaining highly efficient and robust to over-fitting. Through simulations and analysis of real data from 489 nuclear trio families from the mainland US, Puerto Rico and Mexico, we demonstrate that our methods achieve superior accuracy compared with published methods for local ancestry inference in Latinos.


Genome Research | 2015

The landscape of genomic imprinting across diverse adult human tissues

Yael Baran; Meena Subramaniam; Anne Biton; Taru Tukiainen; Emily K. Tsang; Manuel A. Rivas; Matti Pirinen; Maria Gutierrez-Arcelus; Kevin S. Smith; Kim R. Kukurba; Rui Zhang; Celeste Eng; Dara G. Torgerson; Cydney Urbanek; Jin Billy Li; Jose R. Rodriguez-Santana; Esteban G. Burchard; Max A. Seibold; Daniel G. MacArthur; Stephen B. Montgomery; Noah Zaitlen; Tuuli Lappalainen

Genomic imprinting is an important regulatory mechanism that silences one of the parental copies of a gene. To systematically characterize this phenomenon, we analyze tissue specificity of imprinting from allelic expression data in 1582 primary tissue samples from 178 individuals from the Genotype-Tissue Expression (GTEx) project. We characterize imprinting in 42 genes, including both novel and previously identified genes. Tissue specificity of imprinting is widespread, and gender-specific effects are revealed in a small number of genes in muscle with stronger imprinting in males. IGF2 shows maternal expression in the brain instead of the canonical paternal expression elsewhere. Imprinting appears to have only a subtle impact on tissue-specific expression levels, with genes lacking a systematic expression difference between tissues with imprinted and biallelic expression. In summary, our systematic characterization of imprinting in adult tissues highlights variation in imprinting between genes, individuals, and tissues.


Nature Methods | 2016

Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies

Elior Rahmani; Noah Zaitlen; Yael Baran; Celeste Eng; Donglei Hu; Joshua M. Galanter; Sam S. Oh; Esteban G. Burchard; Eleazar Eskin; James Zou; Eran Halperin

In epigenome-wide association studies (EWAS), different methylation profiles of distinct cell types may lead to false discoveries. We introduce ReFACTor, a method based on principal component analysis (PCA) and designed for the correction of cell type heterogeneity in EWAS. ReFACTor does not require knowledge of cell counts, and it provides improved estimates of cell type composition, resulting in improved power and control for false positives in EWAS. Corresponding software is available at http://www.cs.tau.ac.il/~heran/cozygene/software/refactor.html.


Epigenetics & Chromatin | 2015

Characterization of whole-genome autosomal differences of DNA methylation between men and women

Paula Singmann; Doron Shem-Tov; Simone Wahl; Harald Grallert; Giovanni Fiorito; So-Youn Shin; Katharina Schramm; Petra Wolf; Sonja Kunze; Yael Baran; Simonetta Guarrera; Paolo Vineis; Vittorio Krogh; Salvatore Panico; Rosario Tumino; Anja Kretschmer; Christian Gieger; Annette Peters; Holger Prokisch; Caroline L Relton; Giuseppe Matullo; Thomas Illig; Melanie Waldenberger; Eran Halperin

AbstractBackgroundDisease risk and incidence between males and females reveal differences, and sex is an important component of any investigation of the determinants of phenotypes or disease etiology. Further striking differences between men and women are known, for instance, at the metabolic level. The extent to which men and women vary at the level of the epigenome, however, is not well documented. DNA methylation is the best known epigenetic mechanism to date.ResultsIn order to shed light on epigenetic differences, we compared autosomal DNA methylation levels between men and women in blood in a large prospective European cohort of 1799 subjects, and replicated our findings in three independent European cohorts. We identified and validated 1184 CpG sites to be differentially methylated between men and women and observed that these CpG sites were distributed across all autosomes. We showed that some of the differentially methylated loci also exhibit differential gene expression between men and women. Finally, we found that the differentially methylated loci are enriched among imprinted genes, and that their genomic location in the genome is concentrated in CpG island shores.ConclusionOur epigenome-wide association study indicates that differences between men and women are so substantial that they should be considered in design and analyses of future studies.


Human Biology | 2013

No Evidence from Genome-wide Data of a Khazar Origin for the Ashkenazi Jews

Doron M. Behar; Mait Metspalu; Yael Baran; Naama M. Kopelman; Bayazit Yunusbayev; Ariella Gladstein; Shay Tzur; Hovhannes Sahakyan; Ardeshir Bahmanimehr; Levon Yepiskoposyan; Kristiina Tambets; Elza Khusnutdinova; Alena Kushniarevich; Oleg Balanovsky; Elena Balanovsky; Lejla Kovačević; Damir Marjanović; Evelin Mihailov; Anastasia Kouvatsi; Costas Triantaphyllidis; Roy King; Ornella Semino; Antonio Torroni; Michael F. Hammer; Ene Metspalu; Karl Skorecki; Saharon Rosset; Eran Halperin; Richard Villems; Noah A. Rosenberg

Abstract The origin and history of the Ashkenazi Jewish population have long been of great interest, and advances in high-throughput genetic analysis have recently provided a new approach for investigating these topics. We and others have argued on the basis of genome-wide data that the Ashkenazi Jewish population derives its ancestry from a combination of sources tracing to both Europe and the Middle East. It has been claimed, however, through a reanalysis of some of our data, that a large part of the ancestry of the Ashkenazi population originates with the Khazars, a Turkic-speaking group that lived to the north of the Caucasus region ∼1,000 years ago. Because the Khazar population has left no obvious modern descendants that could enable a clear test for a contribution to Ashkenazi Jewish ancestry, the Khazar hypothesis has been difficult to examine using genetics. Furthermore, because only limited genetic data have been available from the Caucasus region, and because these data have been concentrated in populations that are genetically close to populations from the Middle East, the attribution of any signal of Ashkenazi-Caucasus genetic similarity to Khazar ancestry rather than shared ancestral Middle Eastern ancestry has been problematic. Here, through integration of genotypes from newly collected samples with data from several of our past studies, we have assembled the largest data set available to date for assessment of Ashkenazi Jewish genetic origins. This data set contains genome-wide single-nucleotide polymorphisms in 1,774 samples from 106 Jewish and non-Jewish populations that span the possible regions of potential Ashkenazi ancestry: Europe, the Middle East, and the region historically associated with the Khazar Khaganate. The data set includes 261 samples from 15 populations from the Caucasus region and the region directly to its north, samples that have not previously been included alongside Ashkenazi Jewish samples in genomic studies. Employing a variety of standard techniques for the analysis of population-genetic structure, we found that Ashkenazi Jews share the greatest genetic ancestry with other Jewish populations and, among non-Jewish populations, with groups from Europe and the Middle East. No particular similarity of Ashkenazi Jews to populations from the Caucasus is evident, particularly populations that most closely represent the Khazar region. Thus, analysis of Ashkenazi Jews together with a large sample from the region of the Khazar Khaganate corroborates the earlier results that Ashkenazi Jews derive their ancestry primarily from populations of the Middle East and Europe, that they possess considerable shared ancestry with other Jewish populations, and that there is no indication of a significant genetic contribution either from within or from north of the Caucasus region.


American Journal of Human Genetics | 2013

Enhanced localization of genetic samples through linkage-disequilibrium correction.

Yael Baran; Inés Quintela; Angel Carracedo; Bogdan Pasaniuc; Eran Halperin

Characterizing the spatial patterns of genetic diversity in human populations has a wide range of applications, from detecting genetic mutations associated with disease to inferring human history. Current approaches, including the widely used principal-component analysis, are not suited for the analysis of linked markers, and local and long-range linkage disequilibrium (LD) can dramatically reduce the accuracy of spatial localization when unaccounted for. To overcome this, we have introduced an approach that performs spatial localization of individuals on the basis of their genetic data and explicitly models LD among markers by using a multivariate normal distribution. By leveraging external reference panels, we derive closed-form solutions to the optimization procedure to achieve a computationally efficient method that can handle large data sets. We validate the method on empirical data from a large sample of European individuals from the POPRES data set, as well as on a large sample of individuals of Spanish ancestry. First, we show that by modeling LD, we achieve accuracy superior to that of existing methods. Importantly, whereas other methods show decreased performance when dense marker panels are used in the inference, our approach improves in accuracy as more markers become available. Second, we show that accurate localization of genetic data can be achieved with only a part of the genome, and this could potentially enable the spatial localization of admixed samples that have a fraction of their genome originating from a given continent. Finally, we demonstrate that our approach is resistant to distortions resulting from long-range LD regions; such distortions can dramatically bias the results when unaccounted for.


PLOS Computational Biology | 2012

Joint Analysis of Multiple Metagenomic Samples

Yael Baran; Eran Halperin

The availability of metagenomic sequencing data, generated by sequencing DNA pooled from multiple microbes living jointly, has increased sharply in the last few years with developments in sequencing technology. Characterizing the contents of metagenomic samples is a challenging task, which has been extensively attempted by both supervised and unsupervised techniques, each with its own limitations. Common to practically all the methods is the processing of single samples only; when multiple samples are sequenced, each is analyzed separately and the results are combined. In this paper we propose to perform a combined analysis of a set of samples in order to obtain a better characterization of each of the samples, and provide two applications of this principle. First, we use an unsupervised probabilistic mixture model to infer hidden components shared across metagenomic samples. We incorporate the model in a novel framework for studying association of microbial sequence elements with phenotypes, analogous to the genome-wide association studies performed on human genomes: We demonstrate that stratification may result in false discoveries of such associations, and that the components inferred by the model can be used to correct for this stratification. Second, we propose a novel read clustering (also termed “binning”) algorithm which operates on multiple samples simultaneously, leveraging on the assumption that the different samples contain the same microbial species, possibly in different proportions. We show that integrating information across multiple samples yields more precise binning on each of the samples. Moreover, for both applications we demonstrate that given a fixed depth of coverage, the average per-sample performance generally increases with the number of sequenced samples as long as the per-sample coverage is high enough.


Nature Methods | 2017

Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation

Elior Rahmani; Noah Zaitlen; Yael Baran; Celeste Eng; Donglei Hu; Joshua M. Galanter; Sam S. Oh; Esteban G. Burchard; Eleazar Eskin; James Zou; Eran Halperin

Rahmani et al. reply: Zheng et al.1 discuss potential pitfalls in our evaluation of ReFACTor2, a reference-free method to account for cell-type heterogeneity. Below, we reproduce their analysis and demonstrate that conclusions cannot be drawn on the basis of their results owing to conceptual and technical flaws in their analysis. We show with our reanalysis and further evidence from experiments on a total of 10 data sets that ReFACTor has improved performance over alternative methods, including the referencebased method of Houseman et al.3. Zheng et al.1 claim that more evidence needs to be provided to determine whether ReFACTor is applicable to tissues other than blood. They generated a “gold standard” set of “true positives” and “true negatives” for breast cancer differentially methylated CpGs (DMCs) and compared ReFACTor to Surrogate Variable Analysis (SVA)4 using EWAS data. There are multiple problems with this analysis. First, the list of ‘true positives’ is unreliable owing to the fact that only two control individuals were used for its construction (Supplementary Note 1). We show through a simple permutation analysis that using only two controls is likely to result in tens of thousands of spurious ‘true positives’ (Supplementary Fig. 1). Therefore, benchmarking on these ‘true positives’ is an invalid approach. Second, Zheng et al.1 report improved sensitivity for SVA; however, they do not report that a simple unadjusted analysis using a standard Bonferroni significance level achieves considerably better sensitivity and greater specificity than SVA (Supplementary Table 1). Thus, the metric used to evaluate performance is also invalid, as the naive method that does not adjust for cell-type heterogeneity outperforms a method that does. A detailed description of this experiment as well as additional flaws in their analysis is given in Supplementary Note 1. The focus of Zheng et al.1 on the potential loss of power in the case of many true positives is of interest. Because a reliable gold standard is currently not available, we examined this scenario by splitting a large set of breast cancer samples (n = 305)5 into two groups on the basis of the reference-based cell-composition estimates provided by Zheng et al.1. One group was labeled as controls, and differential methylation effects were added to all samples in the other group in more than 20,000 sites. The results (Supplementary Note 2 and Supplementary Tables 2–4) show that ReFACTor and SVA obtain similar sensitivity, but ReFACTor captures the cell composition substantially better than SVA and thus adjusts well for false positives, whereas SVA suffers from thousands of false positives. In contrast to the argument of Zheng et al.1, when ReFACTor is correctly applied (Zheng et al.1 did not follow our guidelines), the ReFACTor components are dominated by information about cell-type composition rather than disease status (Supplementary Note 1 and Supplementary Fig. 2). Zheng et al.1 next consider our original experiment in which FACS cell counts were available2. They argue that successively adding components may cause overfitting. However, our point in that section of Rahmani et al.2. was to evaluate the relative performance of different methods as a function of model dimension, and thus there is no issue of overfitting (Supplementary Note 3). They evaluated ReFACTor by measuring the correlation between each cell type and ReFACTor components selected via likelihood ratio test (LRT) and observed that ReFACTor only slightly improves upon the reference-based approach. However, LRT depends on sample size, hence we re-evaluated ReFACTor using LRT with all 560 samples in the data set (as opposed to a subset). Our analysis revealed more significant components, which leads to a substantial improvement, far outperforming the reference-based approach (Supplementary Note 3 and Supplementary Fig. 3). Finally, Zheng et al.1 try to demonstrate the advantage of the reference-based method3 using a very small data set with known cell composition (n = 18)6. However, in their analysis, Zheng et al.1 did not correct for known batch effects, and we found that adjusting for batch information produces similar performance for ReFACTor and the reference-based method (Supplementary Fig. 4). Furthermore, such a small sample size cannot provide statistically significant evidence for the improvement of any method. Specifically, using multiple subsampled FACS data sets of 18 samples, we observed that the performance of both methods was highly variable (Supplementary Fig. 5). Moreover, Zheng et al.1 relied on a method for determining the dimension of the data (RMT)7. We found that the number of dimensions estimated by RMT is linearly determined by the sample size (R2 > 0.95), making it inapplicable (Supplementary Fig. 6 and Supplementary Note 3). Given that firm conclusions cannot be drawn based on small data sets, we further evaluated the performance of ReFACTor and the reference-based method using five large whole-blood data sets (minimum n = 312). We divided the samples in each data set into two groups on the basis of cell-composition distribution (Supplementary Note 3). Then, we conducted an EWAS on the assignment into groups as the phenotype. In this scenario, the assignment into groups is expected to be correlated with the true underlying cell composition, and an insufficient correction will lead to spurious associations. We found that ReFACTor consistently outperformed the reference-based method; particularly, the


Journal of Computational Biology | 2015

A Note on the Relations Between Spatio-Genetic Models

Yael Baran; Eran Halperin

Modeling human genetic variation along the continuous geographic space is a new research direction that has been stirring interest in the community during the past few years. Multiple recent works suggested different probabilistic models for the relation between geography and genetic sequence, and applied them to geographic localization, detection of selection, and correction of confounding in Genome-Wide Association Studies (GWAS). Prior to these developments, continuous representations of genetic structure were produced almost exclusively using dimensionality reduction techniques, mostly principal component analysis (PCA). Although fast and effective in some tasks, PCA suffers from multiple disadvantages, primarily stemming from a lack of explicit underlying genetic model. We begin this note by explaining the implicit spatio-genetic model that underlies PCA. Our presentation provides insights into some of the recently proposed spatial models; particularly, we show that two of these models can be formulated as modifications of PCA, each removing one of PCAs limitations in the context of genetic analysis. We build on one of the models to derive a nonsupervised procedure for the inference of spatial structure, and empirically demonstrate that it outperforms PCA in spatial inference. We then go on to review a few additional recent works in this unifying perspective.


BMC Bioinformatics | 2015

PIGS: improved estimates of identity-by-descent probabilities by probabilistic IBD graph sampling

Danny S. Park; Yael Baran; Farhad Hormozdiari; Celeste Eng; Dara G. Torgerson; Esteban G. Burchard; Noah Zaitlen

Identifying segments in the genome of different individuals that are identical-by-descent (IBD) is a fundamental element of genetics. IBD data is used for numerous applications including demographic inference, heritability estimation, and mapping disease loci. Simultaneous detection of IBD over multiple haplotypes has proven to be computationally difficult. To overcome this, many state of the art methods estimate the probability of IBD between each pair of haplotypes separately. While computationally efficient, these methods fail to leverage the clique structure of IBD resulting in less powerful IBD identification, especially for small IBD segments.We develop a hybrid approach (PIGS), which combines the computational efficiency of pairwise methods with the power of multiway methods. It leverages the IBD graph structure to compute the probability of IBD conditional on all pairwise estimates simultaneously. We show via extensive simulations and analysis of real data that our method produces a substantial increase in the number of identified small IBD segments.

Collaboration


Dive into the Yael Baran's collaboration.

Top Co-Authors

Avatar

Eran Halperin

University of California

View shared research outputs
Top Co-Authors

Avatar

Celeste Eng

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Noah Zaitlen

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Donglei Hu

University of California

View shared research outputs
Top Co-Authors

Avatar

Eleazar Eskin

University of California

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge