Filippo Utro | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Filippo Utro is active.

Explore More

Publication

Featured researches published by Filippo Utro.

Nature Medicine | 2012

Genetic inactivation of the polycomb repressive complex 2 in T cell acute lymphoblastic leukemia

Panagiotis Ntziachristos; Aristotelis Tsirigos; Pieter Van Vlierberghe; Jelena Nedjic; Thomas Trimarchi; Maria Sol Flaherty; Dolors Ferres-Marco; Vanina Gabriela Da Ros; Zuojian Tang; Jasmin Siegle; Patrik Asp; Michael Hadler; Isaura Rigo; Kim De Keersmaecker; Jay Patel; Tien Huynh; Filippo Utro; Sandrine Poglio; Jeremy B. Samon; Elisabeth Paietta; Janis Racevskis; Jacob M. Rowe; Raul Rabadan; Ross L. Levine; Stuart M. Brown; Françoise Pflumio; M.I. Domínguez; Adolfo A. Ferrando; Iannis Aifantis

T-cell acute lymphoblastic leukemia (T-ALL) is an immature hematopoietic malignancy driven mainly by oncogenic activation of NOTCH1 signaling1. In this study we report the presence of loss-of-function mutations and deletions of EZH2 and SUZ12 genes, encoding critical components of the Polycomb Repressive Complex 2 (PRC2) complex2,3, in 25% of T-ALLs. To further study the role of the PRC2 complex in T-ALL, we used NOTCH1-induced animal models of the disease, as well as human T-ALL samples, and combined locus-specific and global analysis of NOTCH1-driven epigenetic changes. These studies demonstrated that activation of NOTCH1 specifically induces loss of the repressive mark lysine-27 tri-methylation of histone 3 (H3K27me3)4 by antagonizing the activity of the Polycomb Repressive Complex 2 (PRC2) complex. These studies demonstrate a tumor suppressor role for the PRC2 complex in human leukemia and suggest a hitherto unrecognized dynamic interplay between oncogenic NOTCH1 and PRC2 function for the regulation of gene expression and cell transformation.

Genome Biology | 2013

The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color

Juan Carlos Motamayor; Keithanne Mockaitis; Jeremy Schmutz; Niina Haiminen; Donald Livingstone; Omar E. Cornejo; Seth D. Findley; Ping Zheng; Filippo Utro; Stefan Royaert; Christopher A. Saski; Jerry Jenkins; Ram Podicheti; Meixia Zhao; Brian E. Scheffler; Joseph C Stack; Frank Alex Feltus; Guiliana Mustiga; Freddy Amores; Wilbert Phillips; Jean Philippe Marelli; Gregory D. May; Howard Shapiro; Jianxin Ma; Carlos Bustamante; Raymond J. Schnell; Dorrie Main; Don Gilbert; Laxmi Parida; David N. Kuhn

BackgroundTheobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders.ResultsWe describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation.ConclusionsWe report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits.

Bioinformatics | 2009

Textual data compression in computational biology: a synopsis.

Raffaele Giancarlo; Davide Scaturro; Filippo Utro

MOTIVATION Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. RESULTS The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used. When possible, a unifying organization of the main ideas and techniques is also provided. AVAILABILITY It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The Supplementary Material provides pointers to software and benchmark datasets for a range of applications of broad interest. In addition to provide reference to software, the Supplementary Material also gives a brief presentation of some fundamental results and techniques related to this paper. It is at: http://www.math.unipa.it/ approximately raffaele/suppMaterial/compReview/

Algorithms for Molecular Biology | 2011

Speeding up the Consensus Clustering methodology for microarray data analysis

Raffaele Giancarlo; Filippo Utro

BackgroundThe inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up.ResultsSince the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, FC turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of Consensus. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by Consensus. We have also experimented with the use of Consensus and FC in conjunction with NMF (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although NMF is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about NMF, shedding further light on its merits and limitations.ConclusionsIn summary, FC with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.

BMC Bioinformatics | 2008

Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer

Raffaele Giancarlo; Davide Scaturro; Filippo Utro

BackgroundInferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data.ResultsWe consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster Sum-of-Squares) and KL (Krzanowski and Lai index). We perform extensive experiments on six benchmark microarray datasets, using both Hierarchical and K-means clustering algorithms, and we provide an analysis assessing both the intrinsic ability of a measure to predict the correct number of clusters in a dataset and its merit relative to the other measures. We pay particular attention both to precision and speed. Moreover, we also provide various fast approximation algorithms for the computation of Gap, FOM and WCSS. The main result is a hierarchy of those measures in terms of precision and speed, highlighting some of their merits and limitations not reported before in the literature.ConclusionBased on our analysis, we draw several conclusions for the use of those internal measures on microarray data. We report the main ones. Consensus is by far the best performer in terms of predictive power and remarkably algorithm-independent. Unfortunately, on large datasets, it may be of no use because of its non-trivial computer time demand (weeks on a state of the art PC). FOM is the second best performer although, quite surprisingly, it may not be competitive in this scenario: it has essentially the same predictive power of WCSS but it is from 6 to 100 times slower in time, depending on the dataset. The approximation algorithms for the computation of FOM, Gap and WCSS perform very well, i.e., they are faster while still granting a very close approximation of FOM and WCSS. The approximation algorithm for the computation of Gap deserves to be singled-out since it has a predictive power far better than Gap, it is competitive with the other measures, but it is at least two order of magnitude faster in time with respect to Gap. Another important novel conclusion that can be drawn from our analysis is that all the measures we have considered show severe limitations on large datasets, either due to computational demand (Consensus, as already mentioned, Clest and Gap) or to lack of precision (all of the other measures, including their approximations). The software and datasets are available under the GNU GPL on the supplementary material web page.

Briefings in Bioinformatics | 2014

Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies

Raffaele Giancarlo; Simona E. Rombo; Filippo Utro

High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their implementation and availability. Performance of the various methods is also highlighted, although the state of the art does not lend itself to a consistent and coherent comparison among all the methods presented here.

Bioinformatics | 2012

GenomicTools: a computational platform for developing high-throughput analytics in genomics

Aristotelis Tsirigos; Niina Haiminen; Erhan Bilal; Filippo Utro

MOTIVATION Recent advances in sequencing technology have resulted in the dramatic increase of sequencing data, which, in turn, requires efficient management of computational resources, such as computing time, memory requirements as well as prototyping of computational pipelines. RESULTS We present GenomicTools, a flexible computational platform, comprising both a command-line set of tools and a C++ API, for the analysis and manipulation of high-throughput sequencing data such as DNA-seq, RNA-seq, ChIP-seq and MethylC-seq. GenomicTools implements a variety of mathematical operations between sets of genomic regions thereby enabling the prototyping of computational pipelines that can address a wide spectrum of tasks ranging from pre-processing and quality control to meta-analyses. Additionally, the GenomicTools platform is designed to analyze large datasets of any size by minimizing memory requirements. In practical applications, where comparable, GenomicTools outperforms existing tools in terms of both time and memory usage. AVAILABILITY The GenomicTools platform (version 2.0.0) was implemented in C++. The source code, documentation, user manual, example datasets and scripts are available online at http://code.google.com/p/ibm-cbc-genomic-tools.

Theoretical Computer Science | 2012

Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

Raffaele Giancarlo; Filippo Utro

The advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. It is very unfortunate that as fascinating and classic an area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we make the following contributions: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) reductions showing that all of the known methods in this class are an instance of the proposed paradigm; (C) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (D) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the literature for stability-based model selection methods. Since the performance evaluation of model selection algorithms is mainly experimental, we offer, for completeness and without even attempting to be exhaustive, a representative synopsis of known experimental benchmarking results that highlight the ability of stability-based methods for model selection and the computational resources that they require for the task. As a whole, the contributions of this paper generalize in several respects reference methodologies in statistics and show that algorithmic approaches can yield deep methodological insights into this area, in addition to practical computational procedures.

Bioinformatics | 2015

Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning

Raffaele Giancarlo; Simona E. Rombo; Filippo Utro

MOTIVATION Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. RESULTS We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhaustive and within the bounds dictated by the information-theoretic content of the sample sets we use and (ii) informative for comparative epigenomics. We analize four different organisms and we propose a paradigmatic formalization of k-mer dictionaries, providing two different and complementary views of the k-mers involved in NER and NDR. The first extends well known studies in this area, its comparative nature being its major merit. The second, very novel, brings to light the rich variety of k-mers involved in influencing nucleosome positioning, for which an initial classification in terms of clusters is also provided. Although such a classification offers many insights, the following deserves to be singled-out: short poly(dA:dT) tracts are reported in the literature as fundamental for nucleosome depletion, however a global quantitative look reveals that their role is much less prominent than one would expect based on previous studies. AVAILABILITY AND IMPLEMENTATION Dictionaries, clusters and Supplementary Material are available online at http://math.unipa.it/rombo/epigenomics/. CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

computational intelligence methods for bioinformatics and biostatistics | 2010

The three steps of clustering in the post-genomic era: a synopsis

Raffaele Giancarlo; G. Lo Bosco; Luca Pinello; Filippo Utro

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. Following Handl et al., it can be summarized as a three step process: (a) choice of a distance function; (b) choice of a clustering algorithm; (c) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Unfortunately, the high dimensionality of the data and their noisy nature makes cluster analysis of genomic data particularly difficult. This paper highlights new findings that seem to address a few relevant problems in each of the three mentioned steps, both in regard to the intrinsic predictive power of methods and algorithms and their time performance. Inclusion of this latter aspect into the evaluation process is quite novel, since it is hardly considered in genomic data analysis.

Explore More