Is this you? Create Your Porfile

Kassian Kobert

Heidelberg Institute for Theoretical Studies

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kassian Kobert is active.

Explore More

Publication

Featured researches published by Kassian Kobert.

Bioinformatics | 2014

PEAR: a fast and accurate Illumina Paired-End reAd mergeR

Jiajie Zhang; Kassian Kobert; Tomasÿ Flouri; Alexandros Stamatakis

Motivation: The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths. Results: We present the PEAR software for merging raw Illumina paired-end reads from target fragments of varying length. The program evaluates all possible paired-end read overlaps and does not require the target fragment size as input. It also implements a statistical test for minimizing false-positive results. Tests on simulated and empirical data show that PEAR consistently generates highly accurate merged paired-end reads. A highly optimized implementation allows for merging millions of paired-end reads within a few minutes on a standard desktop computer. On multi-core architectures, the parallel version of PEAR shows linear speedups compared with the sequential version of PEAR. Availability and implementation: PEAR is implemented in C and uses POSIX threads. It is freely available at http://www.exelixis-lab.org/web/software/pear. Contact: [email protected]

Molecular Biology and Evolution | 2014

ExaBayes: Massively Parallel Bayesian Tree Inference for the Whole-Genome Era

Andre J. Aberer; Kassian Kobert; Alexandros Stamatakis

Modern sequencing technology now allows biologists to collect the entirety of molecular evidence for reconstructing evolutionary trees. We introduce a novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size. Our software introduces a nonblocking parallelization of Metropolis-coupled chains, modifications for efficient analyses of data sets comprising thousands of partitions and memory saving techniques. We report on first experiences with Bayesian inferences at the whole-genome level using the SuperMUC supercomputer and simulated data.

Bioinformatics | 2017

Multi-rate Poisson tree processes for single-locus species delimitation under maximum likelihood and Markov chain Monte Carlo.

Paschalia Kapli; Sarah Lutteropp; Jiajie Zhang; Kassian Kobert; Pavlos Pavlidis; Alexandros Stamatakis; Tomas Flouri

Motivation: In recent years, molecular species delimitation has become a routine approach for quantifying and classifying biodiversity. Barcoding methods are of particular importance in large‐scale surveys as they promote fast species discovery and biodiversity estimates. Among those, distance‐based methods are the most common choice as they scale well with large datasets; however, they are sensitive to similarity threshold parameters and they ignore evolutionary relationships. The recently introduced “Poisson Tree Processes” (PTP) method is a phylogeny‐aware approach that does not rely on such thresholds. Yet, two weaknesses of PTP impact its accuracy and practicality when applied to large datasets; it does not account for divergent intraspecific variation and is slow for a large number of sequences. Results: We introduce the multi‐rate PTP (mPTP), an improved method that alleviates the theoretical and technical shortcomings of PTP. It incorporates different levels of intraspecific genetic diversity deriving from differences in either the evolutionary history or sampling of each species. Results on empirical data suggest that mPTP is superior to PTP and popular distance‐based methods as it, consistently yields more accurate delimitations with respect to the taxonomy (i.e., identifies more taxonomic species, infers species numbers closer to the taxonomy). Moreover, mPTP does not require any similarity threshold as input. The novel dynamic programming algorithm attains a speedup of at least five orders of magnitude compared to PTP, allowing it to delimit species in large (meta‐) barcoding data. In addition, Markov Chain Monte Carlo sampling provides a comprehensive evaluation of the inferred delimitation in just a few seconds for millions of steps, independently of tree size. Availability and Implementation: mPTP is implemented in C and is available for download at http://github.com/Pas‐Kapli/mptp under the GNU Affero 3 license. A web‐service is available at http://mptp.h‐its.org. Contact: paschalia.kapli@h‐its.org or alexandros.stamatakis@h‐its.org or tomas.flouri@h‐its.org Supplementary information: Supplementary data are available at Bioinformatics online.

Molecular Biology and Evolution | 2016

Computing the Internode Certainty and Related Measures from Partial Gene Trees

Kassian Kobert; Leonidas Salichos; Antonis Rokas; Alexandros Stamatakis

We present, implement, and evaluate an approach to calculate the internode certainty (IC) and tree certainty (TC) on a given reference tree from a collection of partial gene trees. Previously, the calculation of these values was only possible from a collection of gene trees with exactly the same taxon set as the reference tree. An application to sets of partial gene trees requires mathematical corrections in the IC and TC calculations. We implement our methods in RAxML and test them on empirical datasets. These tests imply that the inclusion of partial trees does matter. However, in order to provide meaningful measurements, any dataset should also include trees containing the full species set.

Information Processing Letters | 2015

Longest common substrings with k mismatches

Tomáš Flouri; Emanuele Giaquinta; Kassian Kobert; Esko Ukkonen

The longest common substring with k-mismatches problem is to find, given two strings S 1 and S 2 , a longest substring A 1 of S 1 and A 2 of S 2 such that the Hamming distance between A 1 and A 2 is ?k. We introduce a practical O ( n m ) time and O ( 1 ) space solution for this problem, where n and m are the lengths of S 1 and S 2 , respectively. This algorithm can also be used to compute the matching statistics with k-mismatches of S 1 and S 2 in O ( n m ) time and O ( m ) space. Moreover, we also present a theoretical solution for the k = 1 case which runs in O ( n log ? m ) time, assuming m ? n , and uses O ( m ) space, improving over the existing O ( n m ) time and O ( m ) space bound of Babenko and Starikovskaya 1]. Two new algorithms for the longest common substring with k mismatches problem.A practical solution for arbitrary k which uses constant space.A theoretical solution for one mismatch which runs in quasilinear time.

workshop on algorithms in bioinformatics | 2014

The Divisible Load Balance Problem and Its Application to Phylogenetic Inference

Kassian Kobert; Tomáš Flouri; Andre J. Aberer; Alexandros Stamatakis

Motivated by load balance issues in parallel calculations of the phylogenetic likelihood function we address the problem of distributing divisible items to a given number of bins. The task is to balance the overall sum of (fractional) item sizes per bin, while keeping the maximum number of unique elements in any bin to a minimum. We show that this problem is NP-hard and give a polynomial time approximation algorithm that yields a solution where the sums of (possibly fractional) item sizes are balanced across bins. Moreover, the maximum number of unique elements in the bins is guaranteed to exceed the optimal solution by at most one element. We implement the algorithm in two production-level parallel codes for large-scale likelihood-based phylogenetic inference: ExaML and ExaBayes. For ExaML, we observe best-case runtime improvements of up to a factor of 5.9 compared to the previously implemented data distribution algorithms.

Systematic Biology | 2016

Efficient Detection of Repeating Sites to Accelerate Phylogenetic Likelihood Calculations

Kassian Kobert; Alexandros Stamatakis; Tomas Flouri

&NA; The phylogenetic likelihood function (PLF) is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection, and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run‐time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory savings attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 12‐fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the PLF currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation.

Philosophical Transactions of the Royal Society A | 2014

An optimal algorithm for computing all subtree repeats in trees

Tomáš Flouri; Kassian Kobert; Solon P. Pissis; Alexandros Stamatakis

Given a labelled tree T, our goal is to group repeating subtrees of T into equivalence classes with respect to their topologies and the node labels. We present an explicit, simple and time-optimal algorithm for solving this problem for unrooted unordered labelled trees and show that the running time of our method is linear with respect to the size of T. By unordered, we mean that the order of the adjacent nodes (children/neighbours) of any node of T is irrelevant. An unrooted tree T does not have a node that is designated as root and can also be referred to as an undirected tree. We show how the presented algorithm can easily be modified to operate on trees that do not satisfy some or any of the aforementioned assumptions on the tree structure; for instance, how it can be applied to rooted, ordered or unlabelled trees.

international workshop on combinatorial algorithms | 2013

An Optimal Algorithm for Computing All Subtree Repeats in Trees

Tomáš Flouri; Kassian Kobert; Solon P. Pissis; Alexandros Stamatakis

Given a labeled tree T, our goal is to group repeating subtrees of T into equivalence classes with respect to their topologies and the node labels. We present an explicit, simple, and time-optimal algorithm for solving this problem for unrooted unordered labeled trees, and show that the running time of our method is linear with respect to the size of T. By unordered, we mean that the order of the adjacent nodes (children/neighbors) of any node of T is irrelevant. An unrooted tree T does not have a node that is designated as root and can also be referred to as an undirected tree. We show how the presented algorithm can easily be modified to operate on trees that do not satisfy some or any of the aforementioned assumptions on the tree structure; for instance, how it can be applied to rooted, ordered or unlabeled trees.

bioRxiv | 2015

Are all global alignment algorithms and implementations correct

Tomáš Flouri; Kassian Kobert; Torbjørn Rognes; Alexandros Stamatakis

Pairwise sequence alignment is perhaps the most fundamental bioinformatics operation. An optimal global alignment algorithm was described in 1970 by Needleman and Wunsch. In 1982 Gotoh presented an improved algorithm with lower time complexity. Gotoh’s algorithm is frequently cited (1447 citations, Google Scholar, May 2015), taught and, most importantly, used as well as implemented. While implementing the algorithm, we discovered two mathematical mistakes in Gotoh’s paper that induce sub-optimal sequence alignments. First, there are minor indexing mistakes in the dynamic programming algorithm which become apparent immediately when implementing the procedure. Hence, we report on these for the sake of completeness. Second, there is a more profound problem with the dynamic programming matrix initialization. This initialization issue can easily be missed and find its way into actual implementations. This error is also present in standard text books. Namely, the widely used books by Gusfield and Waterman. To obtain an initial estimate of the extent to which this error has been propagated, we scrutinized freely available undergraduate lecture slides. We found that 8 out of 31 lecture slides contained the mistake, while 16 out of 31 simply omit parts of the initialization, thus giving an incomplete description of the algorithm. Finally, by inspecting ten source codes and running respective tests, we found that five implementations were incorrect. Note that, not all bugs we identified are due to the mistake in Gotoh’s paper. Three implementations rely on additional constraints that limit generality. Thus, only two out of ten yield correct results. We show that the error introduced by Gotoh is straightforward to resolve and provide a correct open-source reference implementation. We do believe though, that raising the awareness about these errors is critical, since the impact of incorrect pairwise sequence alignments that typically represent one of the very first stages in any bioinformatics data analysis pipeline can have a detrimental impact on downstream analyses such as multiple sequence alignment, orthology assignment, phylogenetic analyses, divergence time estimates, etc.

Explore More