Simon J. Puglisi
University of Helsinki
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Simon J. Puglisi.
Bioinformatics | 2009
Jan Schröder; Heiko Schröder; Simon J. Puglisi; Ranjan Sinha; Bertil Schmidt
MOTIVATION Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data. RESULTS We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.
string processing and information retrieval | 2010
Shanika Kuruppu; Simon J. Puglisi; Justin Zobel
Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}
combinatorial pattern matching | 2009
Juha Kärkkäinen; Giovanni Manzini; Simon J. Puglisi
nH_k(T) + s\log n + s\log \frac{N}{s} + O(s)
string processing and information retrieval | 2009
Travis Gagie; Simon J. Puglisi; Andrew Turpin
\end{document} bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}
Theoretical Computer Science | 2012
Travis Gagie; Gonzalo Navarro; Simon J. Puglisi
\O(\ell + \log n)
foundations of software engineering | 2007
Hamid Abdul Basit; Simon J. Puglisi; William F. Smyth; Andrew Turpin; Stan Jarzabek
\end{document} time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.
Theoretical Computer Science | 2008
Simon J. Puglisi; Jamie Simpson; William F. Smyth
The longest-common-prefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values.
Bioinformatics | 2009
Bertil Schmidt; Ranjan Sinha; Bryan Beresford-Smith; Simon J. Puglisi
We show how to use a balanced wavelet tree as a data structure that stores a list of numbers and supports efficient range quantile queries . A range quantile query takes a rank and the endpoints of a sublist and returns the number with that rank in that sublist. For example, if the rank is half the sublists length, then the query returns the sublists median. We also show how these queries can be used to support space-efficient coloured range reporting and document listing .
symposium on experimental and efficient algorithms | 2011
Gonzalo Navarro; Simon J. Puglisi; Daniel Valenzuela
Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.
Theoretical Computer Science | 2014
Jinil Kim; Peter Eades; Rudolf Fleischer; Seok-Hee Hong; Costas S. Iliopoulos; Kunsoo Park; Simon J. Puglisi; Takeshi Tokuyama
Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous well-known tools.