Simon J. Puglisi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Simon J. Puglisi is active.

Explore More

Publication

Featured researches published by Simon J. Puglisi.

Bioinformatics | 2009

SHREC: a short-read error correction method

Jan Schröder; Heiko Schröder; Simon J. Puglisi; Ranjan Sinha; Bertil Schmidt

MOTIVATION Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data. RESULTS We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.

string processing and information retrieval | 2010

Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

Shanika Kuruppu; Simon J. Puglisi; Justin Zobel

Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}

combinatorial pattern matching | 2009

Permuted Longest-Common-Prefix Array

Juha Kärkkäinen; Giovanni Manzini; Simon J. Puglisi

nH_k(T) + s\log n + s\log \frac{N}{s} + O(s)

string processing and information retrieval | 2009

Range Quantile Queries: Another Virtue of Wavelet Trees

Travis Gagie; Simon J. Puglisi; Andrew Turpin

\end{document} bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}

Theoretical Computer Science | 2012

New algorithms on wavelet trees and applications to information retrieval

Travis Gagie; Gonzalo Navarro; Simon J. Puglisi

\O(\ell + \log n)

foundations of software engineering | 2007

Efficient token based clone detection with flexible tokenization

Hamid Abdul Basit; Simon J. Puglisi; William F. Smyth; Andrew Turpin; Stan Jarzabek

\end{document} time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.

Theoretical Computer Science | 2008

How many runs can a string contain

Simon J. Puglisi; Jamie Simpson; William F. Smyth

The longest-common-prefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values.

Bioinformatics | 2009

A fast hybrid short read fragment assembly algorithm

Bertil Schmidt; Ranjan Sinha; Bryan Beresford-Smith; Simon J. Puglisi

We show how to use a balanced wavelet tree as a data structure that stores a list of numbers and supports efficient range quantile queries . A range quantile query takes a rank and the endpoints of a sublist and returns the number with that rank in that sublist. For example, if the rank is half the sublists length, then the query returns the sublists median. We also show how these queries can be used to support space-efficient coloured range reporting and document listing .

symposium on experimental and efficient algorithms | 2011

Practical compressed document retrieval

Gonzalo Navarro; Simon J. Puglisi; Daniel Valenzuela

Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.

Theoretical Computer Science | 2014

Order-preserving matching

Jinil Kim; Peter Eades; Rudolf Fleischer; Seok-Hee Hong; Costas S. Iliopoulos; Kunsoo Park; Simon J. Puglisi; Takeshi Tokuyama

Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous well-known tools.

Explore More