Agnieszka Debudaj-Grabysz
Silesian University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Agnieszka Debudaj-Grabysz.
Bioinformatics | 2015
Sebastian Deorowicz; Marek Kokot; Szymon Grabowski; Agnieszka Debudaj-Grabysz
MOTIVATION Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory. RESULTS We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12 GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k, x)-mers allows to significantly reduce the I/O and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106 GB of compressed size) in about 20 min, on a 6-core Intel i7 PC with an solid-state disk.
BMC Bioinformatics | 2013
Sebastian Deorowicz; Agnieszka Debudaj-Grabysz; Szymon Grabowski
BackgroundThe k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.ResultsWe propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.ConclusionsBy making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.
Lecture Notes in Computer Science | 2005
Agnieszka Debudaj-Grabysz; Rolf Rabenseifner
Concurrent computing can be applied to heuristic methods for combinatorial optimization to shorten computation time, or equivalently, to improve the solution when time is fixed. This paper presents several communication schemes for parallel simulated annealing, focusing on a combination of OpenMP nested in MPI. Strikingly, even though many publications devoted to either intensive or sparse communication methods in parallel simulated annealing exist, only a few comparisons of methods from these two distinctive families have been published; the present paper aspires to partially fill this gap. Implementation for VRPTW—a generally accepted benchmark problem—is used to illustrate the advantages of the hybrid method over others tested.
ICMMI | 2014
Sebastian Deorowicz; Agnieszka Debudaj-Grabysz; Adam Gudyś
Determination of similarities between species is a crucial issue in life sciences. This task is usually done by comparing fragments of genomic or proteomic sequences of organisms subjected to analysis. The basic procedure which facilitates these comparisons is called multiple sequence alignment. There are a lot of algorithms aiming at this problem, which are either accurate or fast. We present Kalign-LCS, a variant of fast Kalign2 algorithm, that addresses the accuracy vs. speed trade-off. It employs the longest common subsequence measure and was thoroughly optimized. Experiments show that it is faster than Kalign2 and produces noticeably more accurate alignments.
Scientific Reports | 2016
Sebastian Deorowicz; Agnieszka Debudaj-Grabysz; Adam Gudyś
Rapid development of modern sequencing platforms enabled an unprecedented growth of protein families databases. The abundance of sets composed of hundreds of thousands sequences is a great challenge for multiple sequence alignment algorithms. In the article we introduce FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilisation of longest common subsequence measure for determining pairwise similarities, a novel method of gap costs evaluation, and a new iterative refinement scheme. Importantly, its implementation is highly optimised and parallelised to make the most of modern computer platforms. Thanks to the above, quality indicators, namely sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms like Clustal Omega or MAFFT for datasets exceeding a few thousand of sequences. The quality does not compromise time and memory requirements which are an order of magnitude lower than that of existing solutions. For example, a family of 415 519 sequences was analysed in less than two hours and required only 8GB of RAM. FAMSA is freely available at this http URL
international conference on parallel processing | 2006
Agnieszka Debudaj-Grabysz; Rolf Rabenseifner
The paper focuses on a parallel implementation of a simulated annealing algorithm. In order to take advantage of the properties of modern clustered SMP architectures a hybrid method using a combination of OpenMP nested in MPI is advocated. The development of the reference implementation is proposed. Furthermore, a few load balancing strategies are introduced: time scheduling at the annealing process level, clustering at the basic annealing step level and suspending—inside of the basic annealing step. The application of the algorithm to VRPTW—a generally accepted benchmark problem—is used to illustrate their positive influence on execution time and the quality of results.
international conference: beyond databases, architectures and structures | 2017
Marek Kokot; Sebastian Deorowicz; Agnieszka Debudaj-Grabysz
The paper introduces RADULS, a new parallel sorter based on radix sort algorithm, intended to organize ultra-large data sets efficiently. For example 4 G 16-byte records can be sorted with 16 threads in less than 15 s on Intel Xeon-based workstation. The implementation of RADULS is not only highly optimized to gain such an excellent performance, but also parallelized in a cache friendly manner to make the most of modern multicore architectures. Besides, our parallel scheduler launches a few different procedures at runtime, according to the current parameters of the execution, for proper workload management. All experiments show RADULS to be superior to competing algorithms.
Bioinformatics | 2018
Sebastian Deorowicz; Joanna Walczyszyn; Agnieszka Debudaj-Grabysz
Motivation: Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e. Pfam, consumes 40‐230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge. Results: We propose a novel compression algorithm, CoMSA, designed especially for aligned data. It is based on a generalization of the positional Burrows‐Wheeler transform for non‐binary alphabets. CoMSA handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e. gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. Availability and implementation: CoMSA is available for free at https://github.com/refresh‐bio/comsa and http://sun.aei.polsl.pl/REFRESH/comsa. Supplementary material: Supplementary data are available at Bioinformatics online.
bioRxiv | 2017
Sebastian Deorowicz; Agnieszka Debudaj-Grabysz; Adam Gudys; Szymon Grabowski
Motivation Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily. Results We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline). Availability Whisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/ Contact [email protected] Supplementary information Supplementary data are available at publisher Web site.
bioRxiv | 2017
Sebastian Deorowicz; Joanna Walczyszyn; Agnieszka Debudaj-Grabysz
Motivation Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge. Results We propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. Availability MSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/msac. Contact [email protected] Supplementary material Supplementary data are available at the publisher Web site.