Michael Beckstette | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Beckstette is active.

Explore More

Publication

Featured researches published by Michael Beckstette.

BMC Bioinformatics | 2006

Fast index based algorithms and software for matching position specific scoring matrices

Michael Beckstette; Robert Homann; Robert Giegerich; Stefan Kurtz

BackgroundIn biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task.ResultsWe present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330.ConclusionOur analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |AMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@|m + m - 1, where m is the length of the PSSM and AMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.

Nature Methods | 2017

Critical assessment of metagenome interpretation − a benchmark of computational metagenomics software

Alexander Sczyrba; Peter Hofmann; Peter Belmann; David Koslicki; Stefan Janssen; Johannes Droege; Ivan Gregor; Stephan Majda; Jessika Fiedler; Eik Dahms; Andreas Bremges; Adrian Fritz; Ruben Garrido-Oter; Tue Sparholt Jørgensen; Nicole Shapiro; Philip D. Blood; Alexey Gurevich; Yang Bai; Dmitrij Turaev; Matthew Z. DeMaere; Rayan Chikhi; Niranjan Nagarajan; Christopher Quince; Fernando Meyer; Monika Balvociute; Lars Hestbjerg Hansen; Søren J. Sørensen; Burton K H Chia; Bertrand Denis; Jeff Froula

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.

Nature Methods | 2017

Critical Assessment of Metagenome Interpretation — a benchmark of metagenomics software

Alexander Sczyrba; Peter Hofmann; Peter Belmann; David Koslicki; Stefan Janssen; Johannes Dröge; Ivan Gregor; Stephan Majda; Jessika Fiedler; Eik Dahms; Andreas Bremges; Adrian Fritz; Ruben Garrido-Oter; Tue Sparholt Jørgensen; Nicole Shapiro; Philip D. Blood; Alexey Gurevich; Yang Bai; Dmitrij Turaev; Matthew Z. DeMaere; Rayan Chikhi; Niranjan Nagarajan; Christopher Quince; Fernando Meyer; Monika Balvočiūtė; Lars Hestbjerg Hansen; Søren J. Sørensen; Burton K H Chia; Bertrand Denis; Jeff Froula

Bioinformatics | 2009

Lightweight comparison of RNAs based on exact sequence–structure matches

Steffen Heyne; Sebastian Will; Michael Beckstette; Rolf Backofen

Motivation: Specific functions of ribonucleic acid (RNA) molecules are often associated with different motifs in the RNA structure. The key feature that forms such an RNA motif is the combination of sequence and structure properties. In this article, we introduce a new RNA sequence–structure comparison method which maintains exact matching substructures. Existing common substructures are treated as whole unit while variability is allowed between such structural motifs. Based on a fast detectable set of overlapping and crossing substructure matches for two nested RNA secondary structures, our method ExpaRNA (exact pattern of alignment of RNA) computes the longest collinear sequence of substructures common to two RNAs in O(H·nm) time and O(nm) space, where H ≪ n·m for real RNA structures. Applied to different RNAs, our method correctly identifies sequence–structure similarities between two RNAs. Results: We have compared ExpaRNA with two other alignment methods that work with given RNA structures, namely RNAforester and RNA_align. The results are in good agreement, but can be obtained in a fraction of running time, in particular for larger RNAs. We have also used ExpaRNA to speed up state-of-the-art Sankoff-style alignment tools like LocARNA, and observe a tradeoff between quality and speed. However, we get a speedup of 4.25 even in the highest quality setting, where the quality of the produced alignment is comparable to that of LocARNA alone. Availability: The presented algorithm is implemented in the program ExpaRNA, which is available from our website (http://www.bioinf.uni-freiburg.de/Software). Contact: {[email protected],[email protected]} Supplementary information: Supplementary data are available at Bioinformatics online.

BMC Bioinformatics | 2011

Structator: fast index-based search for RNA sequence-structure patterns

Fernando Meyer; Stefan Kurtz; Rolf Backofen; Sebastian Will; Michael Beckstette

BackgroundThe secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs.ResultsWe present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods.ConclusionsThe presented methods sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator.

European Journal of Human Genetics | 2015

An AP4B1 frameshift mutation in siblings with intellectual disability and spastic tetraplegia further delineates the AP-4 deficiency syndrome

Hengameh Abdollahpour; Malik Alawi; Fanny Kortüm; Michael Beckstette; Eva Seemanova; Vladimír Komárek; Georg Rosenberger; Kerstin Kutsche

The recently proposed adaptor protein 4 (AP-4) deficiency syndrome comprises a group of congenital neurological disorders characterized by severe intellectual disability (ID), delayed or absent speech, hereditary spastic paraplegia, and growth retardation. AP-4 is a heterotetrameric protein complex with important functions in vesicle trafficking. Mutations in genes affecting different subunits of AP-4, including AP4B1, AP4E1, AP4S1, and AP4M1, have been reported in patients with the AP-4 deficiency phenotype. We describe two siblings from a non-consanguineous couple who presented with severe ID, absent speech, microcephaly, growth retardation, and progressive spastic tetraplegia. Whole-exome sequencing in the two patients identified the novel homozygous 2-bp deletion c.1160_1161delCA (p.(Thr387Argfs*30)) in AP4B1. Sanger sequencing confirmed the mutation in the siblings and revealed it in the heterozygous state in both parents. The AP4B1-associated phenotype has previously been assigned to spastic paraplegia-47. Identification of a novel AP4B1 alteration in two patients with clinical manifestations highly similar to other individuals with mutations affecting one of the four AP-4 subunits further supports the observation that loss of AP-4 assembly or functionality underlies the common clinical features in these patients and underscores the existence of the clinically recognizable AP-4 deficiency syndrome.

BMC Bioinformatics | 2013

Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns

Fernando Meyer; Stefan Kurtz; Michael Beckstette

BackgroundIt is well known that the search for homologous RNAs is more effective if both sequence and structure information is incorporated into the search. However, current tools for searching with RNA sequence-structure patterns cannot fully handle mutations occurring on both these levels or are simply not fast enough for searching large sequence databases because of the high computational costs of the underlying sequence-structure alignment problem.ResultsWe present new fast index-based and online algorithms for approximate matching of RNA sequence-structure patterns supporting a full set of edit operations on single bases and base pairs. Our methods efficiently compute semi-global alignments of structural RNA patterns and substrings of the target sequence whose costs satisfy a user-defined sequence-structure edit distance threshold. For this purpose, we introduce a new computing scheme to optimally reuse the entries of the required dynamic programming matrices for all substrings and combine it with a technique for avoiding the alignment computation of non-matching substrings. Our new index-based methods exploit suffix arrays preprocessed from the target database and achieve running times that are sublinear in the size of the searched sequences. To support the description of RNA molecules that fold into complex secondary structures with multiple ordered sequence-structure patterns, we use fast algorithms for the local or global chaining of approximate sequence-structure pattern matches. The chaining step removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our improved online algorithm is faster than the best previous method by up to factor 45. Our best new index-based algorithm achieves a speedup of factor 560.ConclusionsThe presented methods achieve considerable speedups compared to the best previous method. This, together with the expected sublinear running time of the presented index-based algorithms, allows for the first time approximate matching of RNA sequence-structure patterns in large sequence databases. Beyond the algorithmic contributions, we provide with RaligNAtor a robust and well documented open-source software package implementing the algorithms presented in this manuscript. The RaligNAtor software is available at http://www.zbh.uni-hamburg.de/ralignator.

Bioinformatics | 2009

Significant speedup of database searches with HMMs by search space reduction with PSSM family models

Michael Beckstette; Robert Homann; Robert Giegerich; Stefan Kurtz

Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in todays genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive. Results: We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the methods ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92. Availability: The presented algorithms are implemented in the program PoSSuMsearch2, available for download at http://bibiserv.techfak.uni-bielefeld.de/possumsearch2/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

german conference on bioinformatics | 2004

Genlight: Interactive high-throughput sequence analysis and comparative genomics

Michael Beckstette; Jens T. Mailänder; Richard J. Marhöfer; Alexander Sczyrba; Enno Ohlebusch; Robert Giegerich; Paul M. Selzer

Abstract With rising numbers of fully sequenced genomes the importance of comparative genomics is constantly increasing. Although several software systems for genome comparison analyses do exist, their functionality and flexibility is still limited, compared to the manifold possible applications. Therefore, we developed Genlight(http://piranha.techfak.uni-bielefeld.de.), a Client/Server based program suite for large scale sequence analysis and comparative genomics. Genlight uses the object relational database system PostgreSQL together with a state of the art data representation and a distributed execution approach for large scale analysis tasks. The system includes a wide variety of comparison and sequence manipulation methods and supports the management of nucleotide sequences as well as protein sequences. The comparison methods are complemented by a large variety of visualization methods for the assessment of the generated results. In order to demonstrate the suitability of the system for the treatment of biological questions, Genlight was used to identify potential drug and vaccine targets of the pathogen Helicobacter pylori.

Journal of Integrative Bioinformatics | 2011

CASSys: an integrated software-system for the interactive analysis of ChIP-seq data.

Malik Alawi; Stefan Kurtz; Michael Beckstette

The mapping of DNA-protein interactions is crucial for a full understanding of transcriptional regulation. Chromatin-immunoprecipitation followed by massively parallel sequencing (ChIP-seq) has become the standard technique for analyzing these interactions on a genome-wide scale. We have developed a software system called CASSys (ChIP-seq data Analysis Software System) spanning all steps of ChIP-seq data analysis. It supersedes the laborious application of several single command line tools. CASSys provides functionality ranging from quality assessment and -control of short reads, over the mapping of reads against a reference genome (readmapping) and the detection of enriched regions (peakdetection) to various follow-up analyses. The latter are accessible via a state-of-the-art web interface and can be performed interactively by the user. The follow-up analyses allow for flexible user defined association of putative interaction sites with genes, visualization of their genomic context with an integrated genome browser, the detection of putative binding motifs, the identification of over-represented Gene Ontology-terms, pathway analysis and the visualization of interaction networks. The system is client-server based, accessible via a web browser and does not require any software installation on the client side. To demonstrate CASSyss functionality we used the system for the complete data analysis of a publicly available Chip-seq study that investigated the role of the transcription factor estrogen receptor-α in breast cancer cells.

Explore More