Annarita D'Addabbo
University of Bari
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Annarita D'Addabbo.
BMC Bioinformatics | 2009
Luca Abatangelo; Rosalia Maglietta; Angela Distaso; Annarita D'Addabbo; Teresa Maria Creanza; Sayan Mukherjee; Nicola Ancona
BackgroundThe analysis of high-throughput gene expression data with respect to sets of genes rather than individual genes has many advantages. A variety of methods have been developed for assessing the enrichment of sets of genes with respect to differential expression. In this paper we provide a comparative study of four of these methods: Fishers exact test, Gene Set Enrichment Analysis (GSEA), Random-Sets (RS), and Gene List Analysis with Prediction Accuracy (GLAPA). The first three methods use associative statistics, while the fourth uses predictive statistics. We first compare all four methods on simulated data sets to verify that Fishers exact test is markedly worse than the other three approaches. We then validate the other three methods on seven real data sets with known genetic perturbations and then compare the methods on two cancer data sets where our a priori knowledge is limited.ResultsThe simulation study highlights that none of the three method outperforms all others consistently. GSEA and RS are able to detect weak signals of deregulation and they perform differently when genes in a gene set are both differentially up and down regulated. GLAPA is more conservative and large differences between the two phenotypes are required to allow the method to detect differential deregulation in gene sets. This is due to the fact that the enrichment statistic in GLAPA is prediction error which is a stronger criteria than classical two sample statistic as used in RS and GSEA. This was reflected in the analysis on real data sets as GSEA and RS were seen to be significant for particular gene sets while GLAPA was not, suggesting a small effect size. We find that the rank of gene set enrichment induced by GLAPA is more similar to RS than GSEA. More importantly, the rankings of the three methods share significant overlap.ConclusionThe three methods considered in our study recover relevant gene sets known to be deregulated in the experimental conditions and pathologies analyzed. There are differences between the three methods and GSEA seems to be more consistent in finding enriched gene sets, although no method uniformly dominates over all data sets. Our analysis highlights the deep difference existing between associative and predictive methods for detecting enrichment and the use of both to better interpret results of pathway analysis. We close with suggestions for users of gene set methods.
BMC Bioinformatics | 2006
Nicola Ancona; Rosalia Maglietta; Ada Piepoli; Annarita D'Addabbo; Rosa Cotugno; Maria Savino; Sabino Liuni; Massimo Carella; Francesco Perri
BackgroundIn this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data.ResultsWe estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed.ConclusionsThe method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required.
BMC Bioinformatics | 2005
Nicola Ancona; Rosalia Maglietta; Annarita D'Addabbo; Sabino Liuni
BackgroundThe advent of the technology of DNA microarrays constitutes an epochal change in the classification and discovery of different types of cancer because the information provided by DNA microarrays allows an approach to the problem of cancer analysis from a quantitative rather than qualitative point of view. Cancer classification requires well founded mathematical methods which are able to predict the status of new specimens with high significance levels starting from a limited number of data. In this paper we assess the performances of Regularized Least Squares (RLS) classifiers, originally proposed in regularization theory, by comparing them with Support Vector Machines (SVM), the state-of-the-art supervised learning technique for cancer classification by DNA microarray data. The performances of both approaches have been also investigated with respect to the number of selected genes and different gene selection strategies.ResultsWe show that RLS classifiers have performances comparable to those of SVM classifiers as the Leave-One-Out (LOO) error evaluated on three different data sets shows. The main advantage of RLS machines is that for solving a classification problem they use a linear system of order equal to either the number of features or the number of training examples. Moreover, RLS machines allow to get an exact measure of the LOO error with just one training.ConclusionRLS classifiers are a valuable alternative to SVM classifiers for the problem of cancer classification by gene expression data, due to their simplicity and low computational complexity. Moreover, RLS classifiers show generalization ability comparable to the ones of SVM classifiers also in the case the classification of new specimens involves very few gene expression levels.
IEEE Transactions on Geoscience and Remote Sensing | 2016
Annarita D'Addabbo; Alberto Refice; Guido Pasquariello; Francesco P. Lovergine; Domenico Capolongo; Salvatore Manfreda
Accurate flood mapping is important for both planning activities during emergencies and as a support for the successive assessment of damaged areas. A valuable information source for such a procedure can be remote sensing synthetic aperture radar (SAR) imagery. However, flood scenarios are typical examples of complex situations in which different factors have to be considered to provide accurate and robust interpretation of the situation on the ground. For this reason, a data fusion approach of remote sensing data with ancillary information can be particularly useful. In this paper, a Bayesian network is proposed to integrate remotely sensed data, such as multitemporal SAR intensity images and interferometric-SAR coherence data, with geomorphic and other ground information. The methodology is tested on a case study regarding a flood that occurred in the Basilicata region (Italy) on December 2013, monitored using a time series of COSMO-SkyMed data. It is shown that the synergetic use of different information layers can help to detect more precisely the areas affected by the flood, reducing false alarms and missed identifications which may affect algorithms based on data from a single source. The produced flood maps are compared to data obtained independently from the analysis of optical images; the comparison indicates that the proposed methodology is able to reliably follow the temporal evolution of the phenomenon, assigning high probability to areas most likely to be flooded, in spite of their heterogeneous temporal SAR/InSAR signatures, reaching accuracies of up to 89%.
Artificial Intelligence in Medicine | 2007
Rosalia Maglietta; Annarita D'Addabbo; Ada Piepoli; Francesco Perri; Sabino Liuni; Nicola Ancona
MOTIVATIONS One of the main problems in cancer diagnosis by using DNA microarray data is selecting genes relevant for the pathology by analyzing their expression profiles in tissues in two different phenotypical conditions. The question we pose is the following: how do we measure the relevance of a single gene in a given pathology? METHODS A gene is relevant for a particular disease if we are able to correctly predict the occurrence of the pathology in new patients on the basis of its expression level only. In other words, a gene is informative for the disease if its expression levels are useful for training a classifier able to generalize, that is, able to correctly predict the status of new patients. In this paper we present a selection bias free, statistically well founded method for finding relevant genes on the basis of their classification ability. RESULTS We applied the method on a colon cancer data set and produced a list of relevant genes, ranked on the basis of their prediction accuracy. We found, out of more than 6500 available genes, 54 overexpressed in normal tissues and 77 overexpressed in tumor tissues having prediction accuracy greater than 70% with p-value <or=0.05. CONCLUSIONS The relevance of the selected genes was assessed (a) statistically, evaluating the p-value of the estimate prediction accuracy of each gene; (b) biologically, confirming the involvement of many genes in generic carcinogenic processes and in particular for the colon; (c) comparatively, verifying the presence of these genes in other studies on the same data-set.
international geoscience and remote sensing symposium | 2004
Annarita D'Addabbo; Giuseppe Satalino; Guido Pasquariello; Palma Blonda
In this work, unsupervised change detection techniques, based on three different way to compare images, are presented. Two Landsat TM registered and corrected multi-spectral images, acquired on the same geographical area on 18 May 1996 and 21 May 1997, have been used. In the first comparison technique, for each pair of corresponding pixels, the spectral change vector has been computed as the squared difference in the features vectors at the two times. In the second method, the difference image has been computed using, pixel by pixel, a chi square transformation. The third technique is based on the application of a Self-Organizing Map (SOM) neural network to clusterize the two images before comparison. The three obtained difference images has been then analyzed by using a fully automatic thresholding method exploiting the expectation-maximization (EM) algorithm. The experimental results obtained for the three difference images are comparable, showing a reliable robustness of the unsupervised approach, and only few change are detected on the analyzed scene. Moreover, the experimental results have been compared with a change detection map computed by using a supervised technique, obtaining a good agreement between unsupervised and supervised results that confirms the reliability of the considered approach. The encouraging obtained results allow to use the so-computed percentage value of changes as probability of class transitions in input to a Bayesian supervised change detection method, as presented in a companion paper by the same authors. In this framework, the unsupervised approach may be used to support supervised techniques, providing land cover transitions that can be used as guess values
Journal of Biomedical Informatics | 2010
Rosalia Maglietta; Angela Distaso; Ada Piepoli; Orazio Palumbo; Massimo Carella; Annarita D'Addabbo; Sayan Mukherjee; Nicola Ancona
One of the major problems in genomics and medicine is the identification of gene networks and pathways deregulated in complex and polygenic diseases, like cancer. In this paper, we address the problem of assessing the variability of results of pathways analysis identified in different and independent genome wide expression studies, in which the same phenotypic conditions are assayed. To this end, we assessed the deregulation of 1891 curated gene sets in four independent gene expression data sets of subjects affected by colorectal cancer (CRC). In this comparison we used two well-founded statistical models for evaluating deregulation of gene networks. We found that the results of pathway analysis in expression studies are highly reproducible. Our study revealed 53 pathways identified by the two methods in all the four data sets analyzed with high statistical significance and strong biological relevance with the pathology examined. This set of pathways associated to single markers as well as to whole biological processes altered constitutes a signature of the disease which sheds light on the genetics bases of CRC.
international geoscience and remote sensing symposium | 2002
Guido Pasquariello; N. Ancona; Palma Blonda; Cristina Tarantino; Giuseppe Satalino; Annarita D'Addabbo
This paper presents a comparative evaluation between a classification strategy based on the combination of the outputs of a neural (NN) ensemble and the application of Support Vector Machine (SVM) classifiers in the analysis of remotely sensed data. Two sets of experiments have been carried out on a benchmark data set. The first set concerns the application of linear and non linear techniques to the combination of the outputs of a Multilayer Perceptron (MLP) neural network ensemble. In particular, the Bayesian and the error correlation matrix approaches are used for coefficient selection in the linear combination of the networks outputs. A MLP module is used for the non linear outputs combination. The results of linear and non linear combination schemes are compared and discussed versus the performance of SVM classifiers. The comparative analysis evidences that the nonlinear, MLP based, combination provides the best results among the different combination schemes. On the other hand, better performance can be obtained with SVM classifiers. However, the complexity of the SVM training procedure can be considered a limitation for SVMs application to real-world problems.
BMC Bioinformatics | 2009
Teresa Maria Creanza; David S. Horner; Annarita D'Addabbo; Rosalia Maglietta; Flavio Mignone; Nicola Ancona
BackgroundThe identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.ResultsIn this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value ≤ 0.05).ConclusionWe observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences – this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.
international geoscience and remote sensing symposium | 2000
Andrea Baraldi; Palma Blonda; Giuseppe Satalino; Annarita D'Addabbo; Cristina Tarantino
Radial basis function (RBF) classifiers, which consist of an hidden and an output layer, are traditionally trained with a two-stage hybrid learning approach. This approach combines an unsupervised (data-driven) first stage to adapt RBF hidden layer parameters with a supervised (error-driven) second stage to learn RBF output weights. Several simple strategies that exploit labeled data in the adaptation of centers and spread parameters of RBF hidden units may be pursued. Some of these strategies have been shown to reduce traditional weaknesses of RBF classification, while typical advantages are maintained. In the field of remotely sensed image classification, the authors compare a traditional RBF two-stage hybrid learning procedure with an RBF two-stage learning technique exploiting labeled data to adapt hidden unit parameters.