Simon Gog
University of Melbourne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Simon Gog.
symposium on experimental and efficient algorithms | 2014
Simon Gog; Timo Beller; Alistair Moffat; Matthias Petri
Engineering efficient implementations of compact and succinct structures is time-consuming and challenging, since there is no standard library of easy-to-use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is difficult, since older baseline implementations may not rely on the same basic components, and reimplementing from scratch can be time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing two succinct solutions for top-k document retrieval which can operate on both character and integer alphabets.
Software - Practice and Experience | 2014
Simon Gog; Matthias Petri
Succinct data structures provide the same functionality as their corresponding traditional data structure in compact space. We improve on functions rank and select, which are the basic building blocks of FM‐indexes and other succinct data structures. First, we present a cache‐optimal, uncompressed bitvector representation that outperforms all existing approaches. Next, we improve, in both space and time, on a recent result by Navarro and Providel on compressed bitvectors. Last, we show techniques to perform rank and select on 64‐bit words that are up to three times faster than existing methods. In our experimental evaluation, we first show how our improvements affect cache and runtime performance of both operations on data sets larger than commonly used in the evaluation of succinct data structures. Our experiments show that our improvements to these basic operations significantly improve the runtime performance and compression effectiveness of FM‐indexes on small and large data sets. To our knowledge, our improvements result in FM‐indexes that are either smaller or faster than all current state of the art implementations. Copyright
string processing and information retrieval | 2010
Enno Ohlebusch; Simon Gog; Adrian Kügell
Exact string matching is a problem that computer programmers face on a regular basis, and full-text indexes like the suffix tree or the suffix array provide fast string search over large texts. In the last decade, research on compressed indexes has flourished because the main problem in large-scale applications is the space consumption of the index. Nowadays, the most successful compressed indexes are able to obtain almost optimal space and search time simultaneously. It is known that a myriad of sequence analysis and comparison problems can be solved efficiently with established data structures like the suffix tree or the suffix array, but algorithms on compressed indexes that solve these problem are still lacking at present. Here, we show that matching statistics and maximal exact matches between two strings S1 and S2 can be computed efficiently by matching S2 backwards against a compressed index of S1.
Journal of Discrete Algorithms | 2013
Timo Beller; Simon Gog; Enno Ohlebusch; Thomas Schnattinger
Many sequence analysis tasks can be accomplished with a suffix array, and several of them additionally need the longest common prefix array. In large scale applications, suffix arrays are being replaced with full-text indexes that are based on the Burrows-Wheeler transform. In this paper, we present the first algorithm that computes the longest common prefix array directly on the wavelet tree of the Burrows-Wheeler transformed string. It runs in linear time and a practical implementation requires approximately 2.2 bytes per character.
combinatorial pattern matching | 2010
Thomas Schnattinger; Enno Ohlebusch; Simon Gog
Searching for genes encoding microRNAs (miRNAs) is an important task in genome analysis. Because the secondary structure of miRNA (but not the sequence) is highly conserved, the genes encoding it can be determined by finding regions in a genomic DNA sequence that match the structure. It is known that algorithms using a bidirectional search on the DNA sequence for this task outperform algorithms based on unidirectional search. The data structures supporting a bidirectional search (affix trees and affix arrays), however, are rather complex and suffer from their large space consumption. Here, we present a new data structure called bidirectional wavelet index that supports bidirectional search with much less space. With this data structure, it is possible to search for RNA secondary structural patterns in large genomes, for example the human genome.
combinatorial pattern matching | 2011
Enno Ohlebusch; Simon Gog
For 30 years the Lempel-Ziv factorization of a string has played an important role in data compression, and more recently it was used as the basis of linear time algorithms for the detection of all maximal repetitions (runs) in a string. In this paper, we present two new linear time algorithms: the first one is the fastest and the second is the most space-efficient among all LZ-factorization algorithms known so far.
Information & Computation | 2012
Thomas Schnattinger; Enno Ohlebusch; Simon Gog
Searching for genes encoding microRNAs (miRNAs) is an important task in genome analysis. Because the secondary structure of miRNA (but not the sequence) is highly conserved, the genes encoding it can be determined by finding regions in a genomic DNA sequence that match the structure. It is known that algorithms using a bidirectional search on the DNA sequence for this task outperform algorithms based on unidirectional search. The data structures supporting a bidirectional search (affix trees and affix arrays), however, are rather complex and suffer from their large space consumption. Here, we present a new data structure called bidirectional wavelet index that supports bidirectional search with much less space. With this data structure, it is possible to search for candidates of RNA secondary structural patterns in large genomes, for example the complete human genome. Another important application of this data structure is short read alignment. As a second contribution, we show how bidirectional matching statistics can be computed in linear time.
string processing and information retrieval | 2009
Enno Ohlebusch; Simon Gog
Index structures like the suffix tree or the suffix array are of utmost importance in stringology, most notably in exact string matching. In the last decade, research on compressed index structures has flourished because the main problem in many applications is the space consumption of the index. It is possible to simulate the matching of a pattern against a suffix tree on an enhanced suffix array by using range minimum queries or the so-called child table . In this paper, we show that the Super-Cartesian tree of the LCP-array (with which the suffix array is enhanced) very naturally explains the child table. More important, however, is the fact that the balanced parentheses representation of this tree constitutes a very natural compressed form of the child table which admits to locate all occ occurrences of pattern P of length m in O (m log|Σ| + occ ) time, where Σ is the underlying alphabet. Our compressed child table uses less space than previous solutions to the problem. An implementation is available.
Information Processing Letters | 2010
Enno Ohlebusch; Simon Gog
Gusfield et al. [2,3] devised an optimal O (n + m2) time algorithm for solving the following problem: Given m strings S1, S2, . . . , Sm of total length n, the all-pairs suffixprefix matching problem is the problem of finding, for all j = k with 1 j m and 1 k m, the longest suffix of S j which is a prefix of Sk . Their algorithm uses a generalized suffix tree of the input strings. Suffix trees play a prominent role in algorithmics, but they suffer from two drawbacks:
IEEE Transactions on Knowledge and Data Engineering | 2014
Simon Gog; Alistair Moffat; J. Shane Culpepper; Andrew Turpin; Anthony Wirth
The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper, we describe a new two-level suffix array-based index structure that requires significantly less disk space than previous approaches. Key to the saving is the use of disk blocks that are based on prefixes rather than the more usual uniform-sampling approach, allowing reductions between blocks and subparts of other blocks. We also describe a new in-memory structure-the condensed BWT- and show that it allows common patterns to be resolved without access to the text. Experiments using 64 GB of English web text on a computer with 4 GB of main memory demonstrate the speed and versatility of the new approach. For this data, the index is around one-third the size of previous two-level mechanisms; and the memory footprint of as little as 1% of the text size means that queries can be processed more quickly than is possible with a compact FM-INDEX.