Gergely Korodi
Tampere University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gergely Korodi.
ACM Transactions on Information Systems | 2005
Gergely Korodi; Ioan Tabus
This article presents an efficient algorithm for DNA sequence compression, which achieves the best compression ratios reported over a test set commonly used for evaluating DNA compression programs. The algorithm introduces many refinements to a compression method that combines: (1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression, through reference to preceding approximate matching blocks, (2) encoding by a first order context coding and (3) representing strings in clear, to make efficient use of the redundancy sources in DNA data, under fast execution times. One of the main algorithmic features is the constraint on the matching blocks to include reasonably long contiguous matches, which not only reduces significantly the search time, but also can be used to modify the NML model to exploit the constraint for getting smaller code lengths. The algorithm handles the changing statistics of DNA data in an adaptive way and by predictively encoding the matching pointers it is successful in compressing long approximate matches. Apart from comparison with previous DNA encoding methods, we present compression results for the recently published human genome data.
data compression conference | 2003
Ioan Tabus; Gergely Korodi; Jorma Rissanen
The use of normalized maximum likelihood (NML) model for encoding sequences known to have regularities in the form of approximate repetitions was discussed. A particular version of the NML model was presented for discrete regression, which was shown to provide a very powerful yet simple model for encoding the approximate repeats in DNA sequences. Combining the model of repeats with a simple first order Markov model, a fast lossless compression method was obtained that compares favorably with the existing DNA compression programs. It is remarkable that a simple model, which recursively updates a small number of parameters, is able to reach the state of the art compression ratio for DNA sequences with much more complex models. Being a minimum description length (MDL) model, the NML model may later prove to be useful in studying global and local features of DNA or possibly of other biological sequences.
data compression conference | 2007
Gergely Korodi; Ioan Tabus
We present the NML model for classes of models with memory described by first order dependencies. The model is used for efficiently locating and encoding the best regressor present in a dictionary. By combining the order-1 NML with the order-0 NML model the resulting algorithm achieves a consistent improvement over the earlier order-0 NML algorithm, and it is demonstrated to have superior performance on the practical compression of the human genome
IEEE Signal Processing Magazine | 2007
Gergely Korodi; Ioan Tabus; J. Rissanen; J. Astola
Genomic data provide challenging problems that have been studied in a number of fields such as statistics, signal processing, information theory, and computer science. This article shows that the methodologies and tools that have been recently developed in these fields for modeling signals and processes appear to be most promising for genomic research
data compression conference | 2005
Gergely Korodi; Jorma Rissanen; Ioan Tabus
We discuss a lossless data compression system that uses fixed tree machines to encode data. The idea is to create a sequence of tree machines and a robust escape method aimed at preventing expansion of the encoded string for data whose statistics deviate from those represented by the machines. The resulting algorithm is shown to have superior compression of short files compared to other methods.
IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2007
Gergely Korodi; Ioan Tabus
This paper introduces an algorithm for the lossless compression of DNA files which contain annotation text besides the nucleotide sequence. First, a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.
international symposium on communications, control and signal processing | 2008
Gergely Korodi; Ioan Tabus
In this paper we analyze the Prediction by Partial Match algorithm as the aggregate of several functions operating on a context tree data structure. We describe some of the serious weaknesses of the original method, and propose new directions to improve compression efficiency.
data compression conference | 2000
Ioan Tabus; Gergely Korodi; Jorma Rissanen
An n-state Markov model for symbol occurrences is extended to an equivalent source for variable length strings of symbols in a dictionary at every state i, which are to be encoded with the string index in the dictionary. The algorithm for building the n dictionaries optimizes the rate subject to a given total number of entries in the dictionaries, and it is practical even for Markov sources with thousands of states. The speed of the algorithm stems from encoding by table look-ups of the strings instead of single symbols. For this the n dictionaries need be known both to the encoder and the decoder. A static version of the algorithm is very well suited for creation of compressed files with random access. An adaptive version is shown to be faster than the methods in the PPM class, while providing only slightly lower compression ratios.
international conference on bioinformatics | 2006
Gergely Korodi; Ioan Tabus
This article investigates the efficiency of randomly accessible coding for annotated genome files and compares it to universal coding. The result is an encoder which has excellent compression efficiency on annotated genome sequences, provides instantaneous access to functional elements in the file, and thus it serves as a basis for further applications, such as indexing and searching for specified feature entries.
international conference on bioinformatics | 2007
Gergely Korodi; Ioan Tabus
This article introduces a universal algorithm for creating compressed archives with instantaneous access and decodability of designated functional elements. A special-purpose variant is also given to enhance performance for DNA sequences. The resulting algorithm integrated into an earlier scheme achieves a marked improvement at the randomly accessible coding for annotated genome files, while completely retaining the functionality of instantaneous retrieval of all feature entries.