Szymon Grabowski
Lodz University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Szymon Grabowski.
Bioinformatics | 2011
Sebastian Deorowicz; Szymon Grabowski
MOTIVATION Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression. RESULTS We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org). AVAILABILITY DSRC is freely available at http:/sun.aei.polsl.pl/dsrc.
Bioinformatics | 2015
Sebastian Deorowicz; Marek Kokot; Szymon Grabowski; Agnieszka Debudaj-Grabysz
MOTIVATION Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory. RESULTS We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12 GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k, x)-mers allows to significantly reduce the I/O and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106 GB of compressed size) in about 20 min, on a 6-core Intel i7 PC with an solid-state disk.
Bioinformatics | 2011
Sebastian Deorowicz; Szymon Grabowski
MOTIVATION Storing, transferring and maintaining genomic databases becomes a major challenge because of the rapid technology progress in DNA sequencing and correspondingly growing pace at which the sequencing data are being produced. Efficient compression, with support for extraction of arbitrary snippets of any sequence, is the key to maintaining those huge amounts of data. RESULTS We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300. AVAILABILITY GDC is available at http://sun.aei.polsl.pl/gdc. CONTACT [email protected]. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
string processing and information retrieval | 2005
Kimmo Fredriksson; Szymon Grabowski
We develop a new exact bit-parallel string matching algorithm, based on the Shift-Or algorithm (Baeza-Yates & Gonnet, 1992). Assuming that the pattern representation fits into a single computer word, this algorithm has optimal O(n logσm / m) average running time, as well as optimal O(n) worst case running time, where n, m and σ are the sizes of the text, the pattern, and the alphabet, respectively. We also study several implementation details. The experimental results show that our algorithm is the fastest in most of the cases where it can be applied, displacing even the long-standing BNDM (Navarro & Raffinot, 2000) family of algorithms. Finally, we show how to adapt our techniques for the Shift-Add algorithm (Baeza-Yates & Gonnet, 1992), obtaining optimal time for searching under Hamming distance.
Software - Practice and Experience | 2005
Przemys law Skibiński; Szymon Grabowski; Sebastian Deorowicz
An attractive way to increase text compression is to replace words with references to a text dictionary given in advance. Although there exist a few works in this area, they do not fully exploit the compression possibilities or consider alternative preprocessing variants for various compressors in the latter phase. In this paper, we discuss several aspects of dictionary‐based compression, including compact dictionary representation, and present a PPM/BWCA‐oriented scheme, word replacing transformation, achieving compression ratios higher by 2–6% than the state‐of‐the‐art StarNT (2003) text preprocessor, working at a greater speed. We also present an alternative scheme designed for LZ77 compressors, with the advantage over StarNT of reaching up to 14% in combination with gzip. Copyright
Algorithms for Molecular Biology | 2013
Sebastian Deorowicz; Szymon Grabowski
Post-Sanger sequencing methods produce tons of data, and there is a generalagreement that the challenge to store and process them must be addressedwith data compression. In this review we first answer the question“why compression” in a quantitative manner. Then we also answerthe questions “what” and “how”, by sketching thefundamental compression ideas, describing the main sequencing data types andformats, and comparing the specialized compression algorithms and tools.Finally, we go back to the question “why compression” and giveother, perhaps surprising answers, demonstrating the pervasiveness of datacompression techniques in computational biology.
BMC Bioinformatics | 2013
Sebastian Deorowicz; Agnieszka Debudaj-Grabysz; Szymon Grabowski
BackgroundThe k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.ResultsWe propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.ConclusionsBy making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.
Information Processing Letters | 2009
Holger Petersen; Szymon Grabowski
Given a list of n items and a function defined over sub-lists, we study the space required for computing the function for arbitrary sub-lists in constant time. For the function mode we improve the previously known space bound O(n^2/logn) to O(n^2loglogn/log^2n) words. For median the space bound is improved to O(n^2loglog^2n/log^2n) words from O(n^2@?log^(^k^)n/logn), where k is an arbitrary constant and log^(^k^) is the iterated logarithm.
Bioinformatics | 2013
Sebastian Deorowicz; Agnieszka Danek; Szymon Grabowski
MOTIVATION Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms. RESULTS We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ∼400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions. AVAILABILITY http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license. Supplementary data: Supplementary data are available at Bioinformatics online.
Bioinformatics | 2015
Szymon Grabowski; Sebastian Deorowicz; Łukasz Roguski
MOTIVATION High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. RESULTS We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. AVAILABILITY AND IMPLEMENTATION http://sun.aei.polsl.pl/orcom under a free license. CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.