Is this you? Create Your Porfile

Simon A. Berger

Heidelberg Institute for Theoretical Studies

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Simon A. Berger is active.

Explore More

Publication

Featured researches published by Simon A. Berger.

Systematic Biology | 2011

Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood

Simon A. Berger; Denis Krompass; Alexandros Stamatakis

Abstract We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.

Nature Methods | 2013

Metagenomic species profiling using universal phylogenetic marker genes

Shinichi Sunagawa; Daniel R. Mende; Georg Zeller; Fernando Izquierdo-Carrasco; Simon A. Berger; Jens Roat Kultima; Luis Pedro Coelho; Manimozhiyan Arumugam; Julien Tap; Henrik Bjørn Nielsen; Simon Rasmussen; Søren Brunak; Oluf Pedersen; Francisco Guarner; Willem M. de Vos; Jun Wang; Junhua Li; Joël Doré; S. Dusko Ehrlich; Alexandros Stamatakis; Peer Bork

To quantify known and unknown microorganisms at species-level resolution using shotgun sequencing data, we developed a method that establishes metagenomic operational taxonomic units (mOTUs) based on single-copy phylogenetic marker genes. Applied to 252 human fecal samples, the method revealed that on average 43% of the species abundance and 58% of the richness cannot be captured by current reference genome–based methods. An implementation of the method is available at http://www.bork.embl.de/software/mOTU/.

Bioinformatics | 2011

Aligning short reads to reference alignments and trees

Simon A. Berger; Alexandros Stamatakis

MOTIVATION Likelihood-based methods for placing short read sequences from metagenomic samples into reference phylogenies have been recently introduced. At present, it is unclear how to align those reads with respect to the reference alignment that was deployed to infer the reference phylogeny. Moreover, the adaptability of such alignment methods with respect to the underlying reference alignment strategies/philosophies has not been explored. It has also not been assessed if the reference phylogeny can be deployed in conjunction with the reference alignment to improve alignment accuracy in this context. RESULTS We assess different strategies for short read alignment and propose a novel phylogeny-aware alignment procedure. Our alignment method can improve the accuracy of subsequent phylogenetic placement of the reads into a reference phylogeny by up to 5.8 times compared with phylogeny-agnostic methods. It can be deployed to align reads to alignments generated by using fundamentally different alignment strategies (e.g. PRANK(+F) versus MUSCLE). AVAILABILITY http://www.exelixis-lab.org/software.html

Bioinformatics | 2012

RAxML-Light

Alexandros Stamatakis; Andre J. Aberer; Christian Goll; Stephen A. Smith; Simon A. Berger; Fernando Izquierdo-Carrasco

Motivation: Due to advances in molecular sequencing and the increasingly rapid collection of molecular data, the field of phyloinformatics is transforming into a computational science. Therefore, new tools are required that can be deployed in supercomputing environments and that scale to hundreds or thousands of cores. Results: We describe RAxML-Light, a tool for large-scale phylogenetic inference on supercomputers under maximum likelihood. It implements a light-weight checkpointing mechanism, deploys 128-bit (SSE3) and 256-bit (AVX) vector intrinsics, offers two orthogonal memory saving techniques and provides a fine-grain production-level message passing interface parallelization of the likelihood function. To demonstrate scalability and robustness of the code, we inferred a phylogeny on a simulated DNA alignment (1481 taxa, 20 000 000 bp) using 672 cores. This dataset requires one terabyte of RAM to compute the likelihood score on a single tree. Code Availability: https://github.com/stamatak/RAxML-Light-1.0.5 Data Availability: http://www.exelixis-lab.org/onLineMaterial.tar.bz2 Contact: [email protected] Supplementary Information: Supplementary data are available at Bioinformatics online.

Molecular Biology and Evolution | 2014

Placing environmental next generation sequencing amplicons from microbial eukaryotes into a phylogenetic context

Micah Dunthorn; Johannes Otto; Simon A. Berger; Alexandros Stamatakis; Frédéric Mahé; Sarah Romac; Colomban de Vargas; Stéphane Audic; Alexandra Stock; Frank Kauff; Thorsten Stoeck

Nucleotide positions in the hypervariable V4 and V9 regions of the small subunit (SSU)-rDNA locus are normally difficult to align and are usually removed before standard phylogenetic analyses. Yet, with next-generation sequencing data, amplicons of these regions are all that are available to answer ecological and evolutionary questions that rely on phylogenetic inferences. With ciliates, we asked how inclusion of the V4 or V9 regions, regardless of alignment quality, affects tree topologies using distinct phylogenetic methods (including PairDist that is introduced here). Results show that the best approach is to place V4 amplicons into an alignment of full-length Sanger SSU-rDNA sequences and to infer the phylogenetic tree with RAxML. A sliding window algorithm as implemented in RAxML shows, though, that not all nucleotide positions in the V4 region are better than V9 at inferring the ciliate tree. With this approach and an ancestral-state reconstruction, we use V4 amplicons from European nearshore sampling sites to infer that rather than being primarily terrestrial and freshwater, colpodean ciliates may have repeatedly transitioned from terrestrial/freshwater to marine environments.

computer and information technology | 2010

Efficient PC-FPGA Communication over Gigabit Ethernet

Nikolaos Alachiotis; Simon A. Berger; Alexandros Stamatakis

As FPGAs become larger and more powerful, they are increasingly used as accelerator devices for compute-intensive functions. Input/Output (I/O) speeds can become a bottleneck and directly affect the performance of a reconfigurable accelerator since the chip will idle when there are no data available. While PCI Express represents the currently fastest and most expensive solution to connect a FPGA to a general purpose CPU, there exist several applications with I/O requirements for which Gigabit Ethernet is sufficient. To this end, we present the design of an efficient UDP/IP core for PC-FPGA communication that has been designed to occupy a minimum amount of hardware resources on the FPGA. An observation regarding the internet checksum algorithm, allows us to reduce the hardware requirements for computing the checksum. Furthermore, this property also allows for initiating packet transmission immediately, i.e., the UDP/IP core can start a transmission without the requirement of receiving, storing, and processing user data beforehand. The UDP/IP core is available as open-source code. A comparison with related work on UDP/IP core implementations shows that our implementation is significantly more efficient in terms of resource utilization and performance. The experimental results were obtained on a real-world system and we also make available the PC software test application that is used for performance assessment to allow for reproduction of our results.

BMC Bioinformatics | 2012

Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel

Nikolaos Alachiotis; Simon A. Berger; Alexandros Stamatakis

BackgroundAligning short DNA reads to a reference sequence alignment is a prerequisite for detecting their biological origin and analyzing them in a phylogenetic context. With the PaPaRa tool we introduced a dedicated dynamic programming algorithm for simultaneously aligning short reads to reference alignments and corresponding evolutionary reference trees. The algorithm aligns short reads to phylogenetic profiles that correspond to the branches of such a reference tree. The algorithm needs to perform an immense number of pairwise alignments. Therefore, we explore vector intrinsics and GPUs to accelerate the PaPaRa alignment kernel.ResultsWe optimized and parallelized PaPaRa on CPUs and GPUs. Via SSE 4.1 SIMD (Single Instruction, Multiple Data) intrinsics for x86 SIMD architectures and multi-threading, we obtained a 9-fold acceleration on a single core as well as linear speedups with respect to the number of cores. The peak CPU performance amounts to 18.1 GCUPS (Giga Cell Updates per Second) using all four physical cores on an Intel i7 2600 CPU running at 3.4 GHz. The average CPU performance (averaged over all test runs) is 12.33 GCUPS. We also used OpenCL to execute PaPaRa on a GPU SIMT (Single Instruction, Multiple Threads) architecture. A NVIDIA GeForce 560 GPU delivered peak and average performance of 22.1 and 18.4 GCUPS respectively. Finally, we combined the SIMD and SIMT implementations into a hybrid CPU-GPU system that achieved an accumulated peak performance of 33.8 GCUPS.ConclusionsThis accelerated version of PaPaRa (available athttp://www.exelixis-lab.org/software.html) provides a significant performance improvement that allows for analyzing larger datasets in less time. We observe that state-of-the-art SIMD and SIMT architectures deliver comparable performance for this dynamic programming kernel when the “competing programmer approach” is deployed. Finally, we show that overall performance can be substantially increased by designing a hybrid CPU-GPU system with appropriate load distribution mechanisms.

field-programmable custom computing machines | 2011

Accelerating Phylogeny-Aware Short DNA Read Alignment with FPGAs

Nikolaos Alachiotis; Simon A. Berger; Alexandros Stamatakis

Recent advances in molecular sequencing technology have given rise to novel algorithms for simultaneously aligning short sequence reads to reference sequence alignments and corresponding evolutionary reference trees. We present a complete hardware/software implementation for the acceleration of a program called PaPaRa, a newly introduced dynamic programming algorithm for this purpose. We verify the correctness of the proposed architecture on a real FPGA and introduce a straight-forward communication protocol(using gigabit ethernet) for seamless integration with the encapsulating steering software that is executed on a PC processor. The hardware description and the software implementation are freely available for download. When mapped to a Virtex 6 FPGA, our reconfigurable architecture can compute 133.4 billion cell updates per second for the novel, tree-based alignment kernel of PaPaRa. Compared to PaPaRa, running on a 3.2GHz Intel Core i5 CPU, we obtain speedups for the alignment kernel, that range between 170 and 471. For the entire application, that is, the alignment kernel and the trace-back step, we obtain speedups between 74 and 118.

BMC Bioinformatics | 2013

libgapmis: extending short-read alignments

Nikolaos Alachiotis; Simon A. Berger; Tomáš Flouri; Solon P. Pissis; Alexandros Stamatakis

BackgroundA wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seed-and-extend strategy is applied in most short-read alignment programmes. After aligning a substring of the reference sequence against the high-quality prefix of a short read--the seed--an important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the read--extend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable.ResultsIn this article, we present libgapmis, a library for extending pairwise short-read alignments. Apart from the standard CPU version, it includes ultrafast SSE- and GPU-based implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamic-programming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE- and GPU-based implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment.ConclusionsWe present libgapmis, a library for extending pairwise short-read alignments. We show that libgapmis is better-suited and more efficient than existing algorithms for this task. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any short-read alignment pipeline. The open-source code of libgapmis is available at http://www.exelixis-lab.org/gapmis.

acs ieee international conference on computer systems and applications | 2010

Accuracy of morphology-based phylogenetic fossil placement under Maximum Likelihood

Simon A. Berger; Alexandros Stamatakis

The capability to conduct Maximum Likelihood based phylogenetic (evolutionary) analyses on datasets that contain both morphological, as well as molecular data partitions with programs such as RAxML, gives rise to new methodological questions. As we demonstrate on 5 real world datasets that comprise morphological as well as DNA data the trees inferred by separately using the morphological or molecular data partitions are highly incongruent. Since in typical current-day phylogenomic alignments, there is significantly more molecular than morphological data available, and hence the final tree shape in a concatenated analysis is dominated by molecular data, the question arises how morphological data can be used within this context. One important application lies in the phylogenetic placement of fossil taxa (for which only morphological data is available) into a fixed, given molecular or otherwise well-established reference tree. By using real and simulated datasets we conduct the first assessment of placement accuracy for fossil taxa under the Maximum Likelihood criterion. We demonstrate that, despite conflicting phylogenetic signals from the morphological and molecular partitions, the Maximum Likelihood criterion is powerful enough to yield accurate fossil placements. Moreover, we develop and make available a new morphological site weight calibration algorithm that yields an average improvement of fossil placement accuracy of 20% on more than 2,500 simulated datasets and of 25% on the 5 real-world datasets that all contain highly conflicting phylogenetic signal.

Explore More