Roland Wittler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roland Wittler is active.

Explore More

Publication

Featured researches published by Roland Wittler.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2009

A Unified Approach for Reconstructing Ancient Gene Clusters

Jens Stoye; Roland Wittler

The order of genes in genomes provides extensive information. In comparative genomics, differences or similarities of gene orders are determined to predict functional relations of genes or phylogenetic relations of genomes. For this purpose, various combinatorial models can be used to identify gene clusters—groups of genes that are colocated in a set of genomes. We introduce a unified approach to model gene clusters and define the problem of labeling the inner nodes of a given phylogenetic tree with sets of gene clusters. Our optimization criterion in this context combines two properties: parsimony, i.e., the number of gains and losses of gene clusters has to be minimal, and consistency, i.e., for each ancestral node, there must exist at least one potential gene order that contains all the reconstructed clusters. We present and evaluate an exact algorithm to solve this problem. Despite its exponential worst-case time complexity, our method is suitable even for large-scale data. We show the effectiveness and efficiency on both simulated and real data.

Algorithms for Molecular Biology | 2016

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Guillaume Holley; Roland Wittler; Jens Stoye

BackgroundHigh throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.ResultsIn this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory.ConclusionWe present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure.Availabilityhttps://www.github.com/GuillaumeHolley/BloomFilterTrie.

research in computational molecular biology | 2010

Consistency of sequence-based gene clusters

Roland Wittler; Jens Stoye

In comparative genomics, differences or similarities of gene orders are determined to predict functional relations of genes or phylogenetic relations of genomes. For this purpose, various combinatorial models can be used to specify gene clusters--groups of genes that are co-located in a set of genomes. Several approaches have been proposed to reconstruct putative ancestral gene clusters based on the gene order of contemporary species. One prevalent and natural reconstruction criterion is consistency: For a set of reconstructed gene clusters, there should exist a gene order that comprises all given clusters. For permutation-based gene cluster models, efficient methods exist to verify this condition. In this article, we discuss the consistency problem for different gene cluster models on sequences with restricted gene multiplicities. Our results range from linear-time algorithms for the simple model of adjacencies to NP-completeness proofs for more complex models like common intervals.

Models and Algorithms for Genome Evolution | 2013

The Potential of Family-Free Genome Comparison

Marília D. V. Braga; Cedric Chauve; Daniel Doerr; Katharina Jahn; Jens Stoye; Annelyse Thévenin; Roland Wittler

Many methods in computational comparative genomics require gene family assignments as a prerequisite. While the biological concept of gene families is well established, their computational prediction remains unreliable. This paper continues a new line of research in which family assignments are not presumed. We study the potential of several family-free approaches in detecting conserved structures, genome rearrangements and in reconstructing ancestral gene orders.

workshop on algorithms in bioinformatics | 2015

Bloom Filter Trie – A Data Structure for Pan-Genome Storage

Guillaume Holley; Roland Wittler; Jens Stoye

High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de-Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the Bloom Filter Trie. The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Experimental results prove better performance compared to another state-of-the-art data structure.

combinatorial pattern matching | 2011

Tractability results for the consecutive-ones property with multiplicity

Cedric Chauve; Ján Maňuch; Murray Patterson; Roland Wittler

A binary matrix has the Consecutive-Ones Property (C1P) if its columns can be ordered in such a way that all 1s in each row are consecutive. We consider here a variant of the C1P where columns can appear multiple times in the ordering. Although the general problem of deciding the C1P with multiplicity is NP-complete, we present here a case of interest in comparative genomics that is tractable.

BMC Genomics | 2013

Unraveling overlapping deletions by agglomerative clustering

Roland Wittler

BackgroundStructural variations in human genomes, such as deletions, play an important role in cancer development. Next-Generation Sequencing technologies have been central in providing ways to detect such variations. Methods like paired-end mapping allow to simultaneously analyze data from several samples in order to, e.g., distinguish tumor from patient specific variations. However, it has been shown that, especially in this setting, there is a need to explicitly take overlapping deletions into consideration. Existing tools have only minor capabilities to call overlapping deletions, unable to unravel complex signals to obtain consistent predictions.ResultWe present a first approach specifically designed to cluster short-read paired-end data into possibly overlapping deletion predictions. The method does not make any assumptions on the composition of the data, such as the number of samples, heterogeneity, polyploidy, etc. Taking paired ends mapped to a reference genome as input, it iteratively merges mappings to clusters based on a similarity score that takes both the putative location and size of a deletion into account.ConclusionWe demonstrate that agglomerative clustering is suitable to predict deletions. Analyzing real data from three samples of a cancer patient, we found putatively overlapping deletions and observed that, as a side-effect, erroneous mappings are mostly identified as singleton clusters. An evaluation on simulated data shows, compared to other methods which can output overlapping clusters, high accuracy in separating overlapping from single deletions.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2017

The SCJ Small Parsimony Problem for Weighted Gene Adjacencies

Nina Luhmann; Manuel Lafond; Annelyse Thévenin; Aïda Ouangraoua; Roland Wittler; Cedric Chauve

Reconstructing ancestral gene orders in a given phylogeny is a classical problem in comparative genomics. Most existing methods compare conserved features in extant genomes in the phylogeny to define potential ancestral gene adjacencies, and either try to reconstruct all ancestral genomes under a global evolutionary parsimony criterion, or, focusing on a single ancestral genome, use a scaffolding approach to select a subset of ancestral gene adjacencies, generally aiming at reducing the fragmentation of the reconstructed ancestral genome. In this paper, we describe an exact algorithm for the Small Parsimony Problem that combines both approaches. We consider that gene adjacencies at internal nodes of the species phylogeny are weighted, and we introduce an objective function defined as a convex combination of these weights and the evolutionary cost under the Single-Cut-or-Join (SCJ) model. The weights of ancestral gene adjacencies can, e.g., be obtained through the recent availability of ancient DNA sequencing data, which provide a direct hint at the genome structure of the considered ancestor, or through probabilistic analysis of gene adjacencies evolution. We show the NP-hardness of our problem variant and propose a Fixed-Parameter Tractable algorithm based on the Sankoff-Rousseau dynamic programming algorithm that also allows to sample co-optimal solutions. We apply our approach to mammalian and bacterial data providing different degrees of complexity. We show that including adjacency weights in the objective has a significant impact in reducing the fragmentation of the reconstructed ancestral gene orders. An implementation is available at http://github.com/nluhmann/PhySca.

brazilian symposium on bioinformatics | 2014

Scaffolding of Ancient Contigs and Ancestral Reconstruction in a Phylogenetic Framework

Nina Luhmann; Cedric Chauve; Jens Stoye; Roland Wittler

Ancestral genome reconstruction is an important step in analyzing the evolution of genomes. Recent progress in sequencing ancient DNA led to the publication of so-called paleogenomes and allows the integration of this sequencing data in genome evolution analysis. However, the assembly of ancient genomes is fragmented because of DNA degradation over time. Integrated phylogenetic assembly addresses the issue of genome fragmentation in the ancient DNA assembly while improving the reconstruction of all ancient genomes in the phylogeny. The fragmented assembly of the ancient genome can be represented as an assembly graph, indicating contradicting ordering information of contigs.

research in computational molecular biology | 2017

Dynamic Alignment-Free and Reference-Free Read Compression

Guillaume Holley; Roland Wittler; Jens Stoye; Faraz Hach

The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes. The ideal way to represent and transfer pan-genomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pan-genome. In this paper, we present DARRC, a new alignment-free and reference-free compression method. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.

Explore More