Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tobias Marschall is active.

Publication


Featured researches published by Tobias Marschall.


Nature Genetics | 2014

Whole-genome sequence variation, population structure and demographic history of the Dutch population

Laurent C. Francioli; Androniki Menelaou; Sara L. Pulit; Freerk van Dijk; Pier Francesco Palamara; Clara C. Elbers; Pieter B. T. Neerincx; Kai Ye; Victor Guryev; Wigard P. Kloosterman; Patrick Deelen; Abdel Abdellaoui; Elisabeth M. van Leeuwen; Mannis van Oven; Martijn Vermaat; Mingkun Li; Jeroen F. J. Laros; Lennart C. Karssen; Alexandros Kanterakis; Najaf Amin; Jouke-Jan Hottenga; Eric-Wubbo Lameijer; Mathijs Kattenberg; Martijn Dijkstra; Heorhiy Byelas; Jessica van Setten; Barbera D. C. van Schaik; Jan Bot; Isaac J. Nijman; Ivo Renkens

Whole-genome sequencing enables complete characterization of genetic variation, but geographic clustering of rare alleles demands many diverse populations be studied. Here we describe the Genome of the Netherlands (GoNL) Project, in which we sequenced the whole genomes of 250 Dutch parent-offspring families and constructed a haplotype map of 20.4 million single-nucleotide variants and 1.2 million insertions and deletions. The intermediate coverage (∼13×) and trio design enabled extensive characterization of structural variation, including midsize events (30–500 bp) previously poorly catalogued and de novo mutations. We demonstrate that the quality of the haplotypes boosts imputation accuracy in independent samples, especially for lower frequency alleles. Population genetic analyses demonstrate fine-scale structure across the country and support multiple ancient migrations, consistent with historical changes in sea level and flooding. The GoNL Project illustrates how single-population whole-genome sequencing can provide detailed characterization of genetic variation and may guide the design of future population studies.


Nucleic Acids Research | 2010

Deep sequencing reveals differential expression of microRNAs in favorable versus unfavorable neuroblastoma

Johannes H. Schulte; Tobias Marschall; Marcel Martin; Philipp Rosenstiel; Pieter Mestdagh; Stefanie Schlierf; Theresa Thor; Jo Vandesompele; Angelika Eggert; Stefan Schreiber; Sven Rahmann; Alexander Schramm

Small non-coding RNAs, in particular microRNAs(miRNAs), regulate fine-tuning of gene expression and can act as oncogenes or tumor suppressor genes. Differential miRNA expression has been reported to be of functional relevance for tumor biology. Using next-generation sequencing, the unbiased and absolute quantification of the small RNA transcriptome is now feasible. Neuroblastoma(NB) is an embryonal tumor with highly variable clinical course. We analyzed the small RNA transcriptomes of five favorable and five unfavorable NBs using SOLiD next-generation sequencing, generating a total of >188 000 000 reads. MiRNA expression profiles obtained by deep sequencing correlated well with real-time PCR data. Cluster analysis differentiated between favorable and unfavorable NBs, and the miRNA transcriptomes of these two groups were significantly different. Oncogenic miRNAs of the miR17-92 cluster and the miR-181 family were overexpressed in unfavorable NBs. In contrast, the putative tumor suppressive microRNAs, miR-542-5p and miR-628, were expressed in favorable NBs and virtually absent in unfavorable NBs. In-depth sequence analysis revealed extensive post-transcriptional miRNA editing. Of 13 identified novel miRNAs, three were further analyzed, and expression could be confirmed in a cohort of 70 NBs.


Genome Research | 2015

Characteristics of de novo structural changes in the human genome

Wigard P. Kloosterman; Laurent C. Francioli; Tobias Marschall; Jayne Y. Hehir-Kwa; Abdel Abdellaoui; Eric-Wubbo Lameijer; Matthijs Moed; Vyacheslav Koval; Ivo Renkens; Markus J. van Roosmalen; Pascal P. Arp; Lennart C. Karssen; Bradley P. Coe; Robert E. Handsaker; E. Suchiman; Edwin Cuppen; Djie Tjwan Thung; Mitch McVey; Michael C. Wendl; Cornelia M. van Duijn; Morris A. Swertz; Gert-Jan B. van Ommen; P. Eline Slagboom; Dorret I. Boomsma; Alexander Schönhuth; Evan E. Eichler; Victor Guryev

Small insertions and deletions (indels) and large structural variations (SVs) are major contributors to human genetic diversity and disease. However, mutation rates and characteristics of de novo indels and SVs in the general population have remained largely unexplored. We report 332 validated de novo structural changes identified in whole genomes of 250 families, including complex indels, retrotransposon insertions, and interchromosomal events. These data indicate a mutation rate of 2.94 indels (1-20 bp) and 0.16 SVs (>20 bp) per generation. De novo structural changes affect on average 4.1 kbp of genomic sequence and 29 coding bases per generation, which is 91 and 52 times more nucleotides than de novo substitutions, respectively. This contrasts with the equal genomic footprint of inherited SVs and substitutions. An excess of structural changes originated on paternal haplotypes. Additionally, we observed a nonuniform distribution of de novo SVs across offspring. These results reveal the importance of different mutational mechanisms to changes in human genome structure across generations.


Bioinformatics | 2009

Efficient exact motif discovery

Tobias Marschall; Sven Rahmann

Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. Results: We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. Availability and Implementation: The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/ Contact: [email protected]; [email protected]


BMC Bioinformatics | 2013

Discovering motifs that induce sequencing errors

Manuel Allhoff; Alexander Schönhuth; Marcel Martin; Ivan G. Costa; Sven Rahmann; Tobias Marschall

BackgroundElevated sequencing error rates are the most predominant obstacle in single-nucleotide polymorphism (SNP) detection, which is a major goal in the bulk of current studies using next-generation sequencing (NGS). Beyond routinely handled generic sources of errors, certain base calling errors relate to specific sequence patterns. Statistically principled ways to associate sequence patterns with base calling errors have not been previously described. Extant approaches either incur decisive losses in power, due to relating errors with individual genomic positions rather than motifs, or do not properly distinguish between motif-induced and sequence-unspecific sources of errors.ResultsHere, for the first time, we describe a statistically rigorous framework for the discovery of motifs that induce sequencing errors. We apply our method to several datasets from Illumina GA IIx, HiSeq 2000, and MiSeq sequencers. We confirm previously known error-causing sequence contexts and report new more specific ones.ConclusionsChecking for error-inducing motifs should be included into SNP calling pipelines to avoid false positives. To facilitate filtering of sets of putative SNPs, we provide tracks of error-prone genomic positions (in BED format).Availabilityhttp://discovering-cse.googlecode.com


Bioinformatics | 2012

CLEVER: clique-enumerating variant finder

Tobias Marschall; Ivan G. Costa; Stefan Canzar; Markus Bauer; Gunnar W. Klau; Alexander Schliep; Alexander Schönhuth

MOTIVATION Next-generation sequencing techniques have facilitated a large-scale analysis of human genetic variation. Despite the advances in sequencing speed, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals. RESULTS Here, we present a novel internal segment size based approach, which organizes all, including concordant, reads into a read alignment graph, where max-cliques represent maximal contradiction-free groups of alignments. A novel algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions. For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present relevant performance statistics. We achieve superior performance, in particular, for deletions or insertions (indels) of length 20-100 nt. This has been previously identified as a remaining major challenge in structural variation discovery, in particular, for insert size based approaches. In this size range, we even outperform split-read aligners. We achieve competitive results also on biological data, where our method is the only one to make a substantial amount of correct predictions, which, additionally, are disjoint from those by split-read aligners. AVAILABILITY CLEVER is open source (GPL) and available from http://clever-sv.googlecode.com. CONTACT [email protected] or [email protected]. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.


research in computational molecular biology | 2014

Viral Quasispecies Assembly via Maximal Clique Enumeration

Armin Töpfer; Tobias Marschall; Rowena A. Bull; Fabio Luciani; Alexander Schönhuth; Niko Beerenwinkel

Genetic variability of virus populations within individual hosts is a key determinant of pathogenesis, virulence, and treatment outcome. It is of clinical importance to identify and quantify the intra-host ensemble of viral haplotypes, called viral quasispecies. Ultra-deep next-generation sequencing NGS of mixed samples is currently the only efficient way to probe genetic diversity of virus populations in greater detail. Major challenges with this bulk sequencing approach are i to distinguish genetic diversity from sequencing errors, ii to assemble an unknown number of different, unknown, haplotype sequences over a genomic region larger than the average read length, iii to estimate their frequency distribution, and iv to detect structural variants, such as large insertions and deletions indels that are due to erroneous replication or alternative splicing. Even though NGS is currently introduced in clinical diagnostics, the de-facto standard procedure to assess the quasispecies structure is still single-nucleotide variant SNV calling. Viral phenotypes cannot be predicted solely from individual SNVs, as epistatic interactions are abundant in RNA viruses. Therefore, reconstruction of long-range viral haplotypes has the potential to be adopted, as data is already available.


Briefings in Bioinformatics | 2016

Computational pan-genomics: status, promises and challenges

Tobias Marschall; Manja Marz; Thomas Abeel; Louis J. Dijkstra; Bas E. Dutilh; Ali Ghaffaari; Paul J. Kersey; Wigard P. Kloosterman; Veli Mäkinen; Adam M. Novak; Benedict Paten; David Porubsky; Eric Rivals; Can Alkan; Jasmijn A. Baaijens; Paul I. W. de Bakker; Valentina Boeva; Raoul J. P. Bonnal; Francesca Chiaromonte; Rayan Chikhi; Francesca D. Ciccarelli; Robin Cijvat; Erwin Datema; Cornelia M. van Duijn; Evan E. Eichler; Corinna Ernst; Eleazar Eskin; Erik Garrison; Mohammed El-Kebir; Gunnar W. Klau

Abstract Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.


Journal of Computational Biology | 2015

WhatsHap: weighted haplotype assembly for future-generation sequencing reads

Murray Patterson; Tobias Marschall; Nadia Pisanti; Leo van Iersel; Leen Stougie; Gunnar W. Klau; Alexander Schönhuth

The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.


combinatorial pattern matching | 2008

Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics

Tobias Marschall; Sven Rahmann

We present probabilistic arithmetic automata (PAAs), which can be used to model chains of operations whose operands depend on chance. We provide two different algorithms to exactly calculate the distribution of the results obtained by such probabilistic calculations. Although we introduce PAAs and the corresponding algorithm in a generic manner, our main concern is their application to pattern matching statistics, i.e. we study the distributions of the number of occurrences of a pattern under a given text model. Such calculations play an important role in computational biology as they give access to the significance of pattern occurrences. To assess the practicability of our method, we apply it to the Prosite database of amino acid motifs and to the Jaspar database of transcription factor binding sites. Regarding the latter, we additionally show that our framework permits to take binding affinities predicted from a physical model into account.

Collaboration


Dive into the Tobias Marschall's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sven Rahmann

University of Duisburg-Essen

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Marcel Martin

Technical University of Dortmund

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kathrin Trappe

Free University of Berlin

View shared research outputs
Researchain Logo
Decentralizing Knowledge