Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Christoph Lippert is active.

Publication


Featured researches published by Christoph Lippert.


Nature Methods | 2011

FaST linear mixed models for genome-wide association studies

Christoph Lippert; Jennifer Listgarten; Ying Liu; Carl M. Kadie; Robert I. Davidson; David Heckerman

We describe factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies (GWAS) that scales linearly with cohort size in both run time and memory use. On Wellcome Trust data for 15,000 individuals, FaST-LMM ran an order of magnitude faster than current efficient algorithms. Our algorithm can analyze data for 120,000 individuals in just a few hours, whereas current algorithms fail on data for even 20,000 individuals (http://mscompbio.codeplex.com/).


Nature Methods | 2012

Improved linear mixed models for genome-wide association studies

Jennifer Listgarten; Christoph Lippert; Carl M. Kadie; Robert I. Davidson; Eleazar Eskin; David Heckerman

to determine these similarities1. Here, however, we show theoretically and experimentally that carefully selecting a small number of SNPs systematically increases power (that is, it jointly reduces false positives and false negatives), improves calibration (lessens inflation or deflation of the test statistic) and reduces computational cost. Our approach is motivated by two considerations. First, an LMM with no fixed effects using genetic similarities constructed from a set of SNPs is mathematically equivalent to a linear regression of the SNPs on the phenotype (with weights integrated over independent normal distributions having the same variance—in particular, the genetic variance)3. That is, an LMM using a given set of SNPs for genetic similarity is equivalent to (Bayesian) linear regression using those SNPs as covariates to correct for confounding. In theory, this equivalence holds only for certain forms of genetic similarity matrices, such as the realized relationship matrix2,3. In practice, however, the realized relationship matrix and other measures of similarity, such as identity by state1, yield very similar measures of association (Supplementary Note 1), and thus our demonstration is quite general. Second, regardless of the form of regression used for GWAS, the significance of SNP-phenotype association should be determined by conditioning on exactly those SNPs that are associated with the phenotype. These SNPs include causal SNPs, or those nearby that tag causal SNPs, and SNPs that are associated by way of confounding (for example, because of population structure). By conditioning on causal or tagging SNPs, we reduce the noise in the assessment of the association4. By conditioning on SNPs associated because of confounding, we control for such confounding5. Moreover, if a SNP is unrelated to the phenotype, it should not be in the conditioning set. In the particular case in which we use Bayesian linear regression for GWAS, the inclusion of unrelated SNPs in the genetic similarity matrix decreases the relative influence of each SNP on the phenotype (because all SNP weights share the same prior distribution whose variance—the genetic variance in the LMM view—is estimated from the data). The decrease in influence leads to incomplete correction for confounding and hence inflated test statistics and reduced power. We refer to this phenomenon as ‘dilution.’ To identify SNPs that satisfy these principles, we developed a simple heuristic that yields improved power and calibration. First, we order SNPs by their linear-regression P values from lowest to highest. Then we construct genetic similarity matrices with an increasing number of SNPs as previously ordered until we find the first minimum in lGC (the genomic control factor). In practice, the number of SNPs selected is typically smaller than the number of individuals analyzed, a condition that can be exploited by an existing algorithm, FaST-LMM, to yield large computational savings2. The equivalence between the LMM and Bayesian linear regression also implies that, when a given SNP is being tested, that SNP should be excluded from the computation of genetic similarity to avoid using it as a covariate. Including the SNP would make the log likelihood of the null model higher than it should be and lead to deflation of the test statistic and loss of power. We call this phenomenon ‘proximal contamination’. In addition to the SNP being tested, we also exclude those SNPs in close proximity (for example, within 2 centimorgans), as linkage disequilibrium will lead to a similar deflation and loss of power. A naive algorithm for excluding these from the similarity matrix is computationally expensive, so we developed a speedup (Supplementary Note 2). Together, the linear-regression scan to select SNPs for inclusion in the matrix and Supplementary Table 4). Many proteins were either overrepresented or underrepresented in each of the protease data sets, and clustering showed that enzyme specificity had the most influence on the results. Some examples within the top 1,000 proteins showed that for specific proteins, one protease outperformed all the others (Fig. 1c and Supplementary Fig. 3). Our data demonstrated that quantitation based on both spectral counting and peptide intensity was indeed biased when solely relying on a single protease, and this bias affected even the most abundant proteins, sometimes by more than a factor of 1,000. Amino acid analysis revealed that proteins overrepresented in a data set obtained by a particular protease contained relatively more cleavage-specific residues for that protease (Supplementary Fig. 3). Our data stresses that the best proteotypic peptides are not necessarily tryptic, a finding that may affect other quantitative assays such as selected reaction monitoring as well. Raw and processed mass spectrometry identification data are available through thegpm.org at ftp://ftp.proteomecentral.org/ public/0/ice.0.e.


Nature Methods | 2014

Epigenome-wide association studies without the need for cell-type composition

James Zou; Christoph Lippert; David Heckerman; Martin J. Aryee; Jennifer Listgarten

In epigenome-wide association studies, cell-type composition often differs between cases and controls, yielding associations that simply tag cell type rather than reveal fundamental biology. Current solutions require actual or estimated cell-type composition—information not easily obtainable for many samples of interest. We propose a method, FaST-LMM-EWASher, that automatically corrects for cell-type composition without the need for explicit knowledge of it, and then validate our method by comparison with the state-of-the-art approach. Corresponding software is available from http://www.microsoft.com/science/.


eLife | 2013

A genome-to-genome analysis of associations between human genetic variation, HIV-1 sequence diversity, and viral control

István Bartha; Jonathan M. Carlson; Chanson J. Brumme; Paul J. McLaren; Zabrina L. Brumme; M. John; David W. Haas; Javier Martinez-Picado; Judith Dalmau; Cecilio López-Galíndez; Concepción Casado; Andri Rauch; Huldrych F. Günthard; Enos Bernasconi; Pietro Vernazza; Thomas Klimkait; Sabine Yerly; Stephen J. O’Brien; Jennifer Listgarten; Nico Pfeifer; Christoph Lippert; Nicolo Fusi; Zoltán Kutalik; Todd M. Allen; Viktor Müller; P. Richard Harrigan; David Heckerman; Amalio Telenti; Jacques Fellay

HIV-1 sequence diversity is affected by selection pressures arising from host genomic factors. Using paired human and viral data from 1071 individuals, we ran >3000 genome-wide scans, testing for associations between host DNA polymorphisms, HIV-1 sequence variation and plasma viral load (VL), while considering human and viral population structure. We observed significant human SNP associations to a total of 48 HIV-1 amino acid variants (p<2.4 × 10−12). All associated SNPs mapped to the HLA class I region. Clinical relevance of host and pathogen variation was assessed using VL results. We identified two critical advantages to the use of viral variation for identifying host factors: (1) association signals are much stronger for HIV-1 sequence variants than VL, reflecting the ‘intermediate phenotype’ nature of viral variation; (2) association testing can be run without any clinical data. The proposed genome-to-genome approach highlights sites of genomic conflict and is a strategy generally applicable to studies of host–pathogen interaction. DOI: http://dx.doi.org/10.7554/eLife.01123.001


Bioinformatics | 2013

A Lasso multi-marker mixed model for association mapping with population structure correction

Barbara Rakitsch; Christoph Lippert; Oliver Stegle; Karsten M. Borgwardt

MOTIVATION Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In traits with simple Mendelian architectures, single polymorphic loci explain a significant fraction of the phenotypic variability. However, many traits of interest seem to be subject to multifactorial control by groups of genetic loci. Accurate detection of such multivariate associations is non-trivial and often compromised by limited statistical power. At the same time, confounding influences, such as population structure, cause spurious association signals that result in false-positive findings. RESULTS We propose linear mixed models LMM-Lasso, a mixed model that allows for both multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters; it effectively controls for population structure and scales to genome-wide datasets. LMM-Lasso simultaneously discovers likely causal variants and allows for multi-marker-based phenotype prediction from genotype. We demonstrate the practical use of LMM-Lasso in genome-wide association studies in Arabidopsis thaliana and linkage mapping in mouse, where our method achieves significantly more accurate phenotype prediction for 91% of the considered phenotypes. At the same time, our model dissects the phenotypic variability into components that result from individual single nucleotide polymorphism effects and population structure. Enrichment of known candidate genes suggests that the individual associations retrieved by LMM-Lasso are likely to be genuine. AVAILABILITY Code available under http://webdav.tuebingen. mpg.de/u/karsten/Forschung/research.html. CONTACT [email protected], [email protected] or [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.


Bioinformatics | 2013

A powerful and efficient set test for genetic markers that handles confounders

Jennifer Listgarten; Christoph Lippert; Eun Yong Kang; Jing Xiang; Carl M. Kadie; David Heckerman

Motivation: Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger datasets are used to increase power. Results: We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects—one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured Genetic Analysis Workshop 14 data demonstrates that our method successfully corrects for population structure and family relatedness, whereas application of our method to a 15 000 individual Crohn’s disease case–control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis. Availability: A Python-based library implementing our approach is available at http://mscompbio.codeplex.com. Contact: [email protected] or [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.


Scientific Reports | 2013

An exhaustive epistatic SNP association analysis on expanded Wellcome Trust data.

Christoph Lippert; Jennifer Listgarten; Robert I. Davidson; Jeff Baxter; Hoifung Poon; Carl M. Kadie; David Heckerman

We present an approach for genome-wide association analysis with improved power on the Wellcome Trust data consisting of seven common phenotypes and shared controls. We achieved improved power by expanding the control set to include other disease cohorts, multiple races, and closely related individuals. Within this setting, we conducted exhaustive univariate and epistatic interaction association analyses. Use of the expanded control set identified more known associations with Crohns disease and potential new biology, including several plausible epistatic interactions in several diseases. Our work suggests that carefully combining data from large repositories could reveal many new biological insights through increased power. As a community resource, all results have been made available through an interactive web server.


Nucleic Acids Research | 2013

Patterns of methylation heritability in a genome-wide analysis of four brain regions

Gerald Quon; Christoph Lippert; David Heckerman; Jennifer Listgarten

DNA methylation has been implicated in a number of diseases and other phenotypes. It is, therefore, of interest to identify and understand the genetic determinants of methylation and epigenomic variation. We investigated the extent to which genetic variation in cis-DNA sequence explains variation in CpG dinucleotide methylation in publicly available data for four brain regions from unrelated individuals, finding that 3–4% of CpG loci assayed were heritable, with a mean estimated narrow-sense heritability of 30% over the heritable loci. Over all loci, the mean estimated heritability was 3%, as compared with a recent twin-based study reporting 18%. Heritable loci were enriched for open chromatin regions and binding sites of CTCF, an influential regulator of transcription and chromatin architecture. Additionally, heritable loci were proximal to genes enriched in several known pathways, suggesting a possible functional role for these loci. Our estimates of heritability are conservative, and we suspect that the number of identified heritable loci will increase as the methylome is assayed across a broader range of cell types and the density of the tested loci is increased. Finally, we show that the number of heritable loci depends on the window size parameter commonly used to identify candidate cis-acting single-nucleotide polymorphism variants.


Nature Methods | 2015

Efficient set tests for the genetic analysis of correlated traits

Francesco Paolo Casale; Barbara Rakitsch; Christoph Lippert; Oliver Stegle

Set tests are a powerful approach for genome-wide association testing between groups of genetic variants and quantitative traits. We describe mtSet (http://github.com/PMBio/limix), a mixed-model approach that enables joint analysis across multiple correlated traits while accounting for population structure and relatedness. mtSet effectively combines the benefits of set tests with multi-trait modeling and is computationally efficient, enabling genetic analysis of large cohorts (up to 500,000 individuals) and multiple traits.


Nature Genetics | 2013

FaST-LMM-Select for addressing confounding from spatial structure and rare variants

Jennifer Listgarten; Christoph Lippert; David Heckerman

The reason for our method’s relative benefits in addressing this specific problem of confounding due to spatial structure and rare variants, despite its general-purpose motivation, is quite simple. Our approach5–7 is inspired by the fact that the SNPs used to estimate similarity for the LMM can equivalently be viewed as random covariates in linear regression8. Given this equivalence, it becomes clear that one should use particular SNPs to estimate genetic similarity precisely when they provide information about the phenotype (either because of confounding or because they are causal)6, rather than using all available SNPs, as has been the traditional LMM practice4,9. Following this equivalence to its logical conclusion yielded our new approach6,7, which can be succinctly To the Editor: A recent report by Mathieson and McVean1 showed that confounding in genome-wide association studies (GWAS) resulting from spatially structured populations in conjunction with rare variants could not be corrected by currently available statistical genetics methods. In particular, when simulating that the non-genetic cause of disease arose from a sharply defined spatial region, genomic control2, principal-component analysis (PCA)3 and linear mixed models (LMMs)4,5 all failed to correct for stratification, resulting in systematically inflated test statistics1. Although several research avenues were proposed as possible solutions to the problem1, none has so far been shown to work. Additionally, it was speculated that any method that could correct for such confounding would require fine-grained geographic information1. As it turns out, our recently published LMM algorithm, FaST-LMM-Select5–7, which was not specifically designed to address the particular problem of confounding due to sharply defined spatial structure and rare variants but rather to tackle general types of confounding, does address this problem. Furthermore, it does so without the need for any geographic information. In fact, our approach5–7 yields non-inflated test statistics and maintains maximal power to detect (spatially unstructured) causal SNPs using only SNP and phenotype data. Specifically, to examine inflation, we used the simulated data from Figure 3b of Mathieson and McVean1, comprising 10 synthetic data sets, each with 1,000 SNPs for 800 individuals. Population structure was simulated using a lattice grid, and non-genetic risk was sampled from sharply localized geographic risk on the lattice. When compared to genomic control, linear regression, a traditional LMM4, PCA and rare-variant versions of it, FaST-LMM-Select5–7 was the only method that did not lead to inflated test statistics on these data (Fig. 1a and ref. 1). To examine which method had the greatest power, we augmented this simulated data set with further simulation of 1,000 rare causal SNPs generated independently from binomial distributions with minor allele frequencies (MAFs) drawn uniformly between 0.1% and 4% (MAF of 4% was the cutoff used in practice by Mathieson and McVean). Next, a realized relationship matrix (RRM)8 was constructed from the 1,000 causal SNPs. Finally, the genetic signal was sampled from a zero-mean Gaussian distribution with covariance set to the RRM. Using these data, our approach showed higher power to detect causal variants in these simulations compared to linear regression or a traditional LMM4 (Fig. 1b). FasT-LMM-select for addressing confounding from spatial structure and rare variants

Collaboration


Dive into the Christoph Lippert's collaboration.

Top Co-Authors

Avatar

Oliver Stegle

European Bioinformatics Institute

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Amalio Telenti

J. Craig Venter Institute

View shared research outputs
Researchain Logo
Decentralizing Knowledge