Alexander Schönhuth | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexander Schönhuth is active.

Explore More

Publication

Featured researches published by Alexander Schönhuth.

Nature Genetics | 2014

Whole-genome sequence variation, population structure and demographic history of the Dutch population

Laurent C. Francioli; Androniki Menelaou; Sara L. Pulit; Freerk van Dijk; Pier Francesco Palamara; Clara C. Elbers; Pieter B. T. Neerincx; Kai Ye; Victor Guryev; Wigard P. Kloosterman; Patrick Deelen; Abdel Abdellaoui; Elisabeth M. van Leeuwen; Mannis van Oven; Martijn Vermaat; Mingkun Li; Jeroen F. J. Laros; Lennart C. Karssen; Alexandros Kanterakis; Najaf Amin; Jouke-Jan Hottenga; Eric-Wubbo Lameijer; Mathijs Kattenberg; Martijn Dijkstra; Heorhiy Byelas; Jessica van Setten; Barbera D. C. van Schaik; Jan Bot; Isaac J. Nijman; Ivo Renkens

Whole-genome sequencing enables complete characterization of genetic variation, but geographic clustering of rare alleles demands many diverse populations be studied. Here we describe the Genome of the Netherlands (GoNL) Project, in which we sequenced the whole genomes of 250 Dutch parent-offspring families and constructed a haplotype map of 20.4 million single-nucleotide variants and 1.2 million insertions and deletions. The intermediate coverage (∼13×) and trio design enabled extensive characterization of structural variation, including midsize events (30–500 bp) previously poorly catalogued and de novo mutations. We demonstrate that the quality of the haplotypes boosts imputation accuracy in independent samples, especially for lower frequency alleles. Population genetic analyses demonstrate fine-scale structure across the country and support multiple ancient migrations, consistent with historical changes in sea level and flooding. The GoNL Project illustrates how single-population whole-genome sequencing can provide detailed characterization of genetic variation and may guide the design of future population studies.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2005

Analyzing Gene Expression Time-Courses

Alexander Schliep; Ivan G. Costa; Christine Steinhoff; Alexander Schönhuth

Measuring gene expression over time can provide important insights into basic cellular processes. Identifying groups of genes with similar expression time-courses is a crucial first step in the analysis. As biologically relevant groups frequently overlap, due to genes having several distinct roles in those cellular processes, this is a difficult problem for classical clustering methods. We use a mixture model to circumvent this principal problem, with hidden Markov models (HMMs) as effective and flexible components. We show that the ensuing estimation problem can be addressed with additional labeled data partially supervised learning of mixtures - through a modification of the expectation-maximization (EM) algorithm. Good starting points for the mixture estimation are obtained through a modification to Bayesian model merging, which allows us to learn a collection of initial HMMs. We infer groups from mixtures with a simple information-theoretic decoding heuristic, which quantifies the level of ambiguity in group assignment. The effectiveness is shown with high-quality annotation data. As the HMMs we propose capture asynchronous behavior by design, the groups we find are also asynchronous. Synchronous subgroups are obtained from a novel algorithm based on Viterbi paths. We show the suitability of our HMM mixture approach on biological and simulated data and through the favorable comparison with previous approaches. A software implementing the method is freely available under the GPL from http://ghmm.org/gql.

Genome Research | 2015

Characteristics of de novo structural changes in the human genome

Wigard P. Kloosterman; Laurent C. Francioli; Tobias Marschall; Jayne Y. Hehir-Kwa; Abdel Abdellaoui; Eric-Wubbo Lameijer; Matthijs Moed; Vyacheslav Koval; Ivo Renkens; Markus J. van Roosmalen; Pascal P. Arp; Lennart C. Karssen; Bradley P. Coe; Robert E. Handsaker; E. Suchiman; Edwin Cuppen; Djie Tjwan Thung; Mitch McVey; Michael C. Wendl; Cornelia M. van Duijn; Morris A. Swertz; Gert-Jan B. van Ommen; P. Eline Slagboom; Dorret I. Boomsma; Alexander Schönhuth; Evan E. Eichler; Victor Guryev

Small insertions and deletions (indels) and large structural variations (SVs) are major contributors to human genetic diversity and disease. However, mutation rates and characteristics of de novo indels and SVs in the general population have remained largely unexplored. We report 332 validated de novo structural changes identified in whole genomes of 250 families, including complex indels, retrotransposon insertions, and interchromosomal events. These data indicate a mutation rate of 2.94 indels (1-20 bp) and 0.16 SVs (>20 bp) per generation. De novo structural changes affect on average 4.1 kbp of genomic sequence and 29 coding bases per generation, which is 91 and 52 times more nucleotides than de novo substitutions, respectively. This contrasts with the equal genomic footprint of inherited SVs and substitutions. An excess of structural changes originated on paternal haplotypes. Additionally, we observed a nonuniform distribution of de novo SVs across offspring. These results reveal the importance of different mutational mechanisms to changes in human genome structure across generations.

BMC Bioinformatics | 2013

Discovering motifs that induce sequencing errors

Manuel Allhoff; Alexander Schönhuth; Marcel Martin; Ivan G. Costa; Sven Rahmann; Tobias Marschall

BackgroundElevated sequencing error rates are the most predominant obstacle in single-nucleotide polymorphism (SNP) detection, which is a major goal in the bulk of current studies using next-generation sequencing (NGS). Beyond routinely handled generic sources of errors, certain base calling errors relate to specific sequence patterns. Statistically principled ways to associate sequence patterns with base calling errors have not been previously described. Extant approaches either incur decisive losses in power, due to relating errors with individual genomic positions rather than motifs, or do not properly distinguish between motif-induced and sequence-unspecific sources of errors.ResultsHere, for the first time, we describe a statistically rigorous framework for the discovery of motifs that induce sequencing errors. We apply our method to several datasets from Illumina GA IIx, HiSeq 2000, and MiSeq sequencers. We confirm previously known error-causing sequence contexts and report new more specific ones.ConclusionsChecking for error-inducing motifs should be included into SNP calling pipelines to avoid false positives. To facilitate filtering of sets of putative SNPs, we provide tracks of error-prone genomic positions (in BED format).Availabilityhttp://discovering-cse.googlecode.com

Bioinformatics | 2012

CLEVER: clique-enumerating variant finder

Tobias Marschall; Ivan G. Costa; Stefan Canzar; Markus Bauer; Gunnar W. Klau; Alexander Schliep; Alexander Schönhuth

MOTIVATIONnNext-generation sequencing techniques have facilitated a large-scale analysis of human genetic variation. Despite the advances in sequencing speed, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals.nnnRESULTSnHere, we present a novel internal segment size based approach, which organizes all, including concordant, reads into a read alignment graph, where max-cliques represent maximal contradiction-free groups of alignments. A novel algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions. For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present relevant performance statistics. We achieve superior performance, in particular, for deletions or insertions (indels) of length 20-100 nt. This has been previously identified as a remaining major challenge in structural variation discovery, in particular, for insert size based approaches. In this size range, we even outperform split-read aligners. We achieve competitive results also on biological data, where our method is the only one to make a substantial amount of correct predictions, which, additionally, are disjoint from those by split-read aligners.nnnAVAILABILITYnCLEVER is open source (GPL) and available from http://[email protected] or [email protected] INFORMATIONnSupplementary data are available at Bioinformatics online.

Bioinformatics | 2010

Inferring cancer subnetwork markers using density-constrained biclustering

Phuong Dao; Recep Colak; Raheleh Salari; Flavia Moser; Elai Davicioni; Alexander Schönhuth; Martin Ester

Motivation: Recent genomic studies have confirmed that cancer is of utmost phenotypical complexity, varying greatly in terms of subtypes and evolutionary stages. When classifying cancer tissue samples, subnetwork marker approaches have proven to be superior over single gene marker approaches, most importantly in cross-platform evaluation schemes. However, prior subnetwork-based approaches do not explicitly address the great phenotypical complexity of cancer. Results: We explicitly address this and employ density-constrained biclustering to compute subnetwork markers, which reflect pathways being dysregulated in many, but not necessarily all samples under consideration. In breast cancer we achieve substantial improvements over all cross-platform applicable approaches when predicting TP53 mutation status in a well-established non-cross-platform setting. In colon cancer, we raise prediction accuracy in the most difficult instances from 87% to 93% for cancer versus non−cancer and from 83% to (astonishing) 92%, for with versus without liver metastasis, in well-established cross-platform evaluation schemes. Availability: Software is available on request. Contact: [email protected]; [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Bioinformatics | 2009

Constrained mixture estimation for analysis and robust classification of clinical time series

Ivan G. Costa; Alexander Schönhuth; Christoph Hafemeister; Alexander Schliep

Motivation: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known theoretical issues such as high dimension of feature spaces versus few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focus on classification of interferon-β (IFNβ) treatment response in Multiple Sclerosis (MS) patients which has attracted substantial attention in the recent past. Half of the patients remain unaffected by IFNβ treatment, which is still the standard. For them the treatment should be timely ceased to mitigate the side effects. Results: We propose constrained estimation of mixtures of hidden Markov models as a methodology to classify patient response to IFNβ treatment. The advantages of our approach are that it takes the temporal nature of the data into account and its robustness with respect to noise, missing data and mislabeled samples. Moreover, mixture estimation enables to explore the presence of response sub-groups of patients on the transcriptional level. We clearly outperformed all prior approaches in terms of prediction accuracy, raising it, for the first time, >90%. Additionally, we were able to identify potentially mislabeled samples and to sub-divide the good responders into two sub-groups that exhibited different transcriptional response programs. This is supported by recent findings on MS pathology and therefore may raise interesting clinical follow-up questions. Availability: The method is implemented in the GQL framework and is available at http://www.ghmm.org/gql. Datasets are available at http://www.cin.ufpe.br/∼igcf/MSConst Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

research in computational molecular biology | 2014

Viral Quasispecies Assembly via Maximal Clique Enumeration

Armin Töpfer; Tobias Marschall; Rowena A. Bull; Fabio Luciani; Alexander Schönhuth; Niko Beerenwinkel

Genetic variability of virus populations within individual hosts is a key determinant of pathogenesis, virulence, and treatment outcome. It is of clinical importance to identify and quantify the intra-host ensemble of viral haplotypes, called viral quasispecies. Ultra-deep next-generation sequencing NGS of mixed samples is currently the only efficient way to probe genetic diversity of virus populations in greater detail. Major challenges with this bulk sequencing approach are i to distinguish genetic diversity from sequencing errors, ii to assemble an unknown number of different, unknown, haplotype sequences over a genomic region larger than the average read length, iii to estimate their frequency distribution, and iv to detect structural variants, such as large insertions and deletions indels that are due to erroneous replication or alternative splicing. Even though NGS is currently introduced in clinical diagnostics, the de-facto standard procedure to assess the quasispecies structure is still single-nucleotide variant SNV calling. Viral phenotypes cannot be predicted solely from individual SNVs, as epistatic interactions are abundant in RNA viruses. Therefore, reconstruction of long-range viral haplotypes has the potential to be adopted, as data is already available.

intelligent systems in molecular biology | 2004

Robust inference of groups in gene expression time-courses using mixtures of HMMs

Alexander Schliep; Christine Steinhoff; Alexander Schönhuth

MOTIVATIONnGenetic regulation of cellular processes is frequently investigated using large-scale gene expression experiments to observe changes in expression over time. This temporal data poses a challenge to classical distance-based clustering methods due to its horizontal dependencies along the time-axis. We propose to use hidden Markov models (HMMs) to explicitly model these time-dependencies. The HMMs are used in a mixture approach that we show to be superior over clustering. Furthermore, mixtures are a more realistic model of the biological reality, as an unambiguous partitioning of genes into clusters of unique functional assignment is impossible. Use of the mixture increases robustness with respect to noise and allows an inference of groups at varying level of assignment ambiguity. A simple approach, partially supervised learning, allows to benefit from prior biological knowledge during the training. Our method allows simultaneous analysis of cyclic and non-cyclic genes and copes well with noise and missing values.nnnRESULTSnWe demonstrate biological relevance by detection of phase-specific groupings in HeLa time-course data. A benchmark using simulated data, derived using assumptions independent of those in our method, shows very favorable results compared to the baseline supplied by k-means and two prior approaches implementing model-based clustering. The results stress the benefits of incorporating prior knowledge, whenever available.nnnAVAILABILITYnA software package implementing our method is freely available under the GNU general public license (GPL) at http://ghmm.org/gql

Briefings in Bioinformatics | 2016

Computational pan-genomics: status, promises and challenges

Tobias Marschall; Manja Marz; Thomas Abeel; Louis J. Dijkstra; Bas E. Dutilh; Ali Ghaffaari; Paul J. Kersey; Wigard P. Kloosterman; Veli Mäkinen; Adam M. Novak; Benedict Paten; David Porubsky; Eric Rivals; Can Alkan; Jasmijn A. Baaijens; Paul I. W. de Bakker; Valentina Boeva; Raoul J. P. Bonnal; Francesca Chiaromonte; Rayan Chikhi; Francesca D. Ciccarelli; Robin Cijvat; Erwin Datema; Cornelia M. van Duijn; Evan E. Eichler; Corinna Ernst; Eleazar Eskin; Erik Garrison; Mohammed El-Kebir; Gunnar W. Klau

Abstract Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.

Explore More