Giovanna Rosone | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Giovanna Rosone is active.

Explore More

Publication

Featured researches published by Giovanna Rosone.

Theoretical Computer Science | 2007

An extension of the Burrows–Wheeler Transform

Sabrina Mantaci; Antonio Restivo; Giovanna Rosone; Marinella Sciortino

We describe and highlight a generalization of the Burrows-Wheeler Transform (bwt) to a multiset of words. The extended transformation, denoted by ebwt, is reversible. Moreover, it allows to define a bijection between the words over a finite alphabet A and the finite multisets of conjugacy classes of primitive words in A^*. Besides its mathematical interest, the extended transform can be useful for applications in the context of string processing. In the last part of this paper we illustrate one such application, providing a similarity measure between sequences based on ebwt.

Theoretical Computer Science | 2013

Lightweight algorithms for constructing and inverting the BWT of string collections

Markus J. Bauer; Anthony J. Cox; Giovanna Rosone

Recent progress in the field of DNA sequencing motivates us to consider the problem of computing the Burrows-Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the BWT at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the BWT for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the BWT of a string collection in a lightweight fashion. A first implementation of our algorithm needs O(mlogm) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve RAM usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the BWT a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the BWT of the original collection can be efficiently updated to obtain the BWT of the revised collection.

combinatorial pattern matching | 2011

Lightweight BWT construction for very large string collections

Markus J. Bauer; Anthony J. Cox; Giovanna Rosone

A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context. We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. Although our tests were on collections of DNA sequences of uniform length, the algorithms themselves apply to any string collection over any alphabet.

combinatorial pattern matching | 2005

An extension of the burrows wheeler transform and applications to sequence comparison and data compression

Sabrina Mantaci; Antonio Restivo; Giovanna Rosone; Marinella Sciortino

We introduce a generalization of the Burrows-Wheeler Transform (BWT) that can be applied to a multiset of words. The extended transformation, denoted by E, is reversible, but, differently from BWT, it is also surjective. The E transformation allows to give a definition of distance between two sequences, that we apply here to the problem of the whole mitochondrial genome phylogeny. Moreover we give some consideration about compressing a set of words by using the E transformation as preprocessing.

Theory of Computing Systems \/ Mathematical Systems Theory | 2008

A New Combinatorial Approach to Sequence Comparison

Sabrina Mantaci; Antonio Restivo; Giovanna Rosone; Marinella Sciortino

Abstract In this paper we introduce a new alignment-free method for comparing sequences which is combinatorial by nature and does not use any compressor nor any information-theoretic notion. Such a method is based on an extension of the Burrows-Wheeler Transform, a transformation widely used in the context of Data Compression. The new extended transformation takes as input a multiset of sequences and produces as output a string obtained by a suitable rearrangement of the characters of all the input sequences. By using such a transformation we give a general method for comparing sequences that takes into account how much the characters coming from the different input sequences are mixed in the output string. Such a method is tested on a real data set for the whole mitochondrial genome phylogeny problem. However, the goal of this paper is to introduce a new and general methodology for automatic categorization of sequences.

Bioinformatics | 2014

Adaptive reference-free compression of sequence quality scores

Lilian Janin; Giovanna Rosone; Anthony J. Cox

MOTIVATION Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and, hence, are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full resolution. As our approach relies directly on redundancy present in the reads, it does not need a reference sequence and is, therefore, applicable to data from metagenomics and de novo experiments as well as to re-sequencing data. RESULTS We show that a conservative smoothing strategy affecting 75% of the quality scores above Q2 leads to an overall quality score compression of 1 bit per value with a negligible effect on variant calling. A compression of 0.68 bit per quality value is achieved using a more aggressive smoothing strategy, again with a very small effect on variant calling. AVAILABILITY Code to construct the BWT and LCP-array on large genomic data sets is part of the BEETL library, available as a github repository at [email protected]:BEETL/BEETL.git.

workshop on algorithms in bioinformatics | 2012

Comparing DNA sequence collections by direct comparison of compressed text indexes

Anthony J. Cox; Tobias Jakobi; Giovanna Rosone; Ole Schulz-Trieglaff

Popular sequence alignment tools such as BWA convert a reference genome to an indexing data structure based on the Burrows-Wheeler Transform (BWT), from which matches to individual query sequences can be rapidly determined. However the utility of also indexing the query sequences themselves remains relatively unexplored. Here we show that an all-against-all comparison of two sequence collections can be computed from the BWT of each collection with the BWTs held entirely in external memory, i.e. on disk and not in RAM. As an application of this technique, we show that BWTs of transcriptomic and genomic reads can be compared to obtain reference-free predictions of splice junctions that have high overlap with results from more standard reference-based methods. Code to construct and compare the BWT of large genomic data sets is available at http://beetl.github.com/BEETL/ as part of the BEETL library.

developments in language theory | 2013

Suffixes, Conjugates and Lyndon Words

Silvia Bonomo; Sabrina Mantaci; Antonio Restivo; Giovanna Rosone; Marinella Sciortino

In this paper we are interested in the study of the combinatorial aspects connecting three important constructions in the field of string algorithms: the suffix array, the Burrows-Wheeler transform (BWT) and the extended Burrows-Wheeler transform (EBWT). Such constructions involve the notions of suffixes and conjugates of words and are based on two different order relations, denoted by < lex and ≺ ω , that, even if strictly connected, are quite different from the computational point of view. In this study an important role is played by Lyndon words. In particular, we improve the upper bound on the number of symbol comparisons needed to establish the ≺ ω order between two primitive words by using a preliminary knowledge of the < lex order of the corresponding Lyndon conjugates. Moreover, we propose an algorithm that efficiently sorts, according to the ≺ ω order, the list of conjugates of a multiset of Lyndon words. Finally, we show that the Lyndon factorization of a word helps the construction of its suffix array, allowing a reduction of the number of symbol comparisons needed to lexicographically sort the suffixes of the word.

italian conference on theoretical computer science | 2005

A new combinatorial approach to sequence comparison

Sabrina Mantaci; Antonio Restivo; Giovanna Rosone; Marinella Sciortino

In this paper we introduce a new alignment-free method for comparing sequences which is combinatorial by nature and does not use any compressor nor any information-theoretic notion. Such a method is based on an extension of the Burrows-Wheeler Transform, a transformation widely used in the context of Data Compression. The new extended transformation takes as input a multiset of sequences and produces as output a string obtained by a suitable rearrangement of the characters of all the input sequences. By using such a transformation we define a measure to compare sequences that takes into account how the characters coming from different input sequences are mixed in the output string. Such a method is tested on a real data set for the whole mitochondrial genome phylogeny problem. However, the goal of this paper is to introduce a new and general methodology for automatic categorization of sequences.

Journal of Discrete Algorithms | 2014

Suffix array and Lyndon factorization of a text

Sabrina Mantaci; Antonio Restivo; Giovanna Rosone; Marinella Sciortino

Abstract The main goal of this paper is to highlight the relationship between the suffix array of a text and its Lyndon factorization. It is proved in [15] that one can obtain the Lyndon factorization of a text from its suffix array. Conversely, here we show a new method for constructing the suffix array of a text that takes advantage of its Lyndon factorization. The surprising consequence of our results is that, in order to construct the suffix array, the local suffixes inside each Lyndon factor can be separately processed, allowing different implementative scenarios, such as online, external and internal memory, or parallel implementations. Based on our results, the algorithm that we propose sorts the suffixes by starting from the leftmost Lyndon factors, even if the whole text or the complete Lyndon factorization are not yet available.

Explore More