Cinzia Pizzi
University of Padua
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Cinzia Pizzi.
Bioinformatics | 2009
Janne H. Korhonen; Petri Martinmäki; Cinzia Pizzi; Pasi Rastas; Esko Ukkonen
Summary: MOODS (MOtif Occurrence Detection Suite) is a software package for matching position weight matrices against DNA sequences. MOODS implements state-of-the-art online matching algorithms, achieving considerably faster scanning speed than with a simple brute-force search. MOODS is written in C++, with bindings for the popular BioPerl and Biopython toolkits. It can easily be adapted for different purposes and integrated into existing workflows. It can also be used as a C++ library. Availability: The package with documentation and examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The source code is also available under the terms of a GNU General Public License (GPL). Contact: [email protected]
BMC Bioinformatics | 2005
Stefania Bortoluzzi; Alessandro Coppe; Andrea Bisognin; Cinzia Pizzi; Gian Antonio Danieli
BackgroundSearching for approximate patterns in large promoter sequences frequently produces an exceedingly high numbers of results. Our aim was to exploit biological knowledge for definition of a sheltered search space and of appropriate search parameters, in order to develop a method for identification of a tractable number of sequence motifs.ResultsNovel software (COOP) was developed for extraction of sequence motifs, based on clustering of exact or approximate patterns according to the frequency of their overlapping occurrences. Genomic sequences of 1 Kb upstream of 91 genes differentially expressed and/or encoding proteins with relevant function in adult human retina were analyzed. Methodology and results were tested by analysing 1,000 groups of putatively unrelated sequences, randomly selected among 17,156 human gene promoters. When applied to a sample of human promoters, the method identified 279 putative motifs frequently occurring in retina promoters sequences. Most of them are localized in the proximal portion of promoters, less variable in central region than in lateral regions and similar to known regulatory sequences. COOP software and reference manual are freely available upon request to the Authors.ConclusionThe approach described in this paper seems effective for identifying a tractable number of sequence motifs with putative regulatory role.
IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2011
Cinzia Pizzi; Pasi Rastas; Esko Ukkonen
Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.
bioinformatics research and development | 2007
Cinzia Pizzi; Pasi Rastas; Esko Ukkonen
Fast search algorithms for finding good instances of patterns given as position specific scoring matrices are developed, and some empirical results on their performance on DNA sequences are reported. The algorithms basically generalize the Aho-Corasick, filtration, and superalphabet techniques of string matching to the scoring matrix search. As compared to the naive search, our algorithms can be faster by a factor which is proportional to the length of the pattern. In our experimental comparison of different algorithms the new algorithms were clearly faster than the naive method and also faster than the well-known lookahead scoring algorithm. The Aho-Corasick technique is the fastest for short patterns and high significance thresholds of the search. For longer patterns the filtration method is better while the superalphabet technique is the best for very long patterns and low significance levels. We also observed that the actual speed of all these algorithms is very sensitive to implementation details.
Theoretical Computer Science | 2008
Cinzia Pizzi; Esko Ukkonen
Position-specific scoring matrices are a popular choice for modelling signals or motifs in biological sequences, both in DNA and protein contexts. A lot of effort has been dedicated to the definition of suitable scores and thresholds for increasing the specificity of the model and the sensitivity of the search. It is quite surprising that, until very recently, little attention has been paid to the actual process of finding the matches of the matrices in a set of sequences, once the score and the threshold have been fixed. In fact, most profile matching tools still rely on a simple sliding window approach to scan the input sequences. This can be a very time expensive routine when searching for hits of a large set of scoring matrices in a sequence database. In this paper we will give a survey of proposed approaches to speed up profile matching based on statistical significance, multipattern matching, filtering, indexing data structures, matrix partitioning, Fast Fourier Transform and data compression. These approaches improve the expected searching time of profile matching, thus leading to implementation of faster tools in practice.
Algorithms for Molecular Biology | 2016
Cinzia Pizzi
BackgroundMeasuring sequence similarity is central for many problems in bioinformatics. In several contexts alignment-free techniques based on exact occurrences of substrings are faster, but also less accurate, than alignment-based approaches. Recently, several studies attempted to bridge the accuracy gap with the introduction of approximate matches in the definition of composition-based similarity measures.ResultsIn this work we present MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of a sequence x and a sequence y. This collection of statistics is useful for the computation of two similarity measures: the longest and the average common substring with k mismatches. As a further contribution we provide a “relaxed” version of MissMax that does not guarantee the exact solution, but it is faster in practice and still very precise.
Discrete Applied Mathematics | 2007
Alberto Apostolico; Cinzia Pizzi
The detection of frequent patterns such as motifs and higher aggregates is of paramount interest in biology and invests many other applications of automated discovery. The problem with its variants is usually plagued with computational burden. A related difficulty is posed by the fact, that due to the sheer mole of candidates, the tables and indices at the outset tend to be bulky, un-manageable, and ultimately uninformative. For solid patterns, it is possible to compact the size of statistical indices by resort to certain monotonicities exhibited by popular scores. The savings come from the fact that these monotonicities enable one to partition the candidate over-represented words into families in such a way that it suffices to consider and weigh only one candidate per family. In this paper, we study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent query: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k^2) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for the probability and expected frequency of a substring under extension, increased number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns.
Theoretical Computer Science | 2014
Laxmi Parida; Cinzia Pizzi; Simona E. Rombo
Eliminating the possible redundancy from a set of candidate motifs occurring in an input string is fundamental in many applications. The existing techniques proposed to extract irredundant motifs are not suitable when the motifs to search for are structured, i.e., they are made of two (or several) subwords that co-occur in a text string s of length n. The main effort of this work is studying and characterizing a compact class of tandem motifs, that is, pairs of substrings occurring in tandem within a maximum distance of d symbols in s, where d is an integer constant given in input. To this aim, we first introduce the concept of maximality, related to four specific conditions that hold only for this class of motifs. Then, we eliminate the remaining redundancy by defining the notion of irredundancy for tandem motifs. We prove that the number of non-overlapping irredundant tandem motifs is O(d^2n) which, considering d as a constant, leads to a linear number of tandems in the length of the input string. This is an order of magnitude less than previously developed compact indexes for tandem extraction. The notions and bounds provided for tandem motifs are generalized for the case r>=2, if r is the number of subwords composing the motifs. Finally, we also provide an algorithm to extract irredundant tandem motifs.
Bioinformatics | 2016
Samuele Girotto; Cinzia Pizzi; Matteo Comin
MOTIVATION Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Taxonomic analysis of microbial communities, a process referred to as binning, is one of the most challenging tasks when analyzing metagenomic reads data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species and the limitations due to short read lengths and sequencing errors. RESULTS MetaProb is a novel assembly-assisted tool for unsupervised metagenomic binning. The novelty of MetaProb derives from solving a few important problems: how to divide reads into groups of independent reads, so that k-mer frequencies are not overestimated; how to convert k-mer counts into probabilistic sequence signatures, that will correct for variable distribution of k-mers, and for unbalanced groups of reads, in order to produce better estimates of the underlying genome statistic; how to estimate the number of species in a dataset. We show that MetaProb is more accurate and efficient than other state-of-the-art tools in binning both short reads datasets (F-measure 0.87) and long reads datasets (F-measure 0.97) for various abundance ratios. Also, the estimation of the number of species is more accurate than MetaCluster. On a real human stool dataset MetaProb identifies the most predominant species, in line with previous human gut studies. AVAILABILITY AND IMPLEMENTATION https://bitbucket.org/samu661/metaprob CONTACTS [email protected] or [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
data compression conference | 2014
Alberto Apostolico; Concettina Guerra; Cinzia Pizzi
A growing number of measures of sequence similarity is being based on some underlying notion of relative compressibility. Within this paradigm, similar sequences are expected to share a large number of common substrings, or subsequences, or more complex patterns or motifs, and so on. The computational complexity of such measures varies, and it increases with the complexion of the patterns taken into account. At the low end of the spectrum, most measures based on the bags of shared substrings are typically afforded in linear time. This performance is no longer achievable as soon as some degree of distortion is accepted. In this paper, measures of sequence similarity are introduced and studied in which patterns in a pair are considered similar if they coincide up to a preset number of mismatches, that is, within a bounded Hamming distance. It is shown here that for some such measures bounds are achievable that are slightly better than O(n2). Preliminary experiments demonstrate the potential applicability to phylogeny and classification of similarity measures that are rougher than previously adopted ones.