Giulia Fiscon
National Research Council
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Giulia Fiscon.
Biodata Mining | 2014
Emanuel Weitschek; Giulia Fiscon; Giovanni Felici
BackgroundSpecific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms.MethodsIn this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods.ResultsA software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods.ConclusionsThe classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community.
PLOS ONE | 2017
Federica Conte; Giulia Fiscon; Matteo Chiara; Teresa Colombo; Lorenzo Farina; Paola Paci; Turgay Unver
Recent findings have identified competing endogenous RNAs (ceRNAs) as the drivers in many disease conditions, including cancers. The ceRNAs indirectly regulate each other by reducing the amount of microRNAs (miRNAs) available to target messenger RNAs (mRNAs). The ceRNA interactions mediated by miRNAs are modulated by a titration mechanism, i.e. large changes in the ceRNA expression levels either overcome, or relieve, the miRNA repression on competing RNAs; similarly, a very large miRNA overexpression may abolish competition. The ceRNAs are also called miRNA “decoys” or miRNA “sponges” and encompass different RNAs competing with each other to attract miRNAs for interactions: mRNA, long non-coding RNAs (lncRNAs), pseudogenes, or circular RNAs. Recently, we developed a computational method for identifying ceRNA-ceRNA interactions in breast invasive carcinoma. We were interested in unveiling which lncRNAs could exert the ceRNA activity. We found a drastic rewiring in the cross-talks between ceRNAs from the physiological to the pathological condition. The main actor of this dysregulated lncRNA-associated ceRNA network was the lncRNA PVT1, which revealed a net biding preference towards the miR-200 family members in normal breast tissues. Despite its up-regulation in breast cancer tissues, mimicked by the miR-200 family members, PVT1 stops working as ceRNA in the cancerous state. The specific conditions required for a ceRNA landscape to occur are still far from being determined. Here, we emphasized the importance of the relative concentration of the ceRNAs, and their related miRNAs. In particular, we focused on the withdrawal in breast cancer tissues of the PVT1 ceRNA activity and performed a gene expression and sequence analysis of its multiple isoforms. We found that the PVT1 isoform harbouring the binding site for a representative miRNA of the miR-200 family shows a drastic decrease in its relative concentration with respect to the miRNA abundance in breast cancer tissues, providing a plausibility argument to the breakdown of the sponge program orchestrated by the oncogene PVT1.
European Journal of Operational Research | 2016
Paola Bertolazzi; Giovanni Felici; Paola Festa; Giulia Fiscon; Emanuel Weitschek
Feature selection methods are used in machine learning and data analysis to select a subset of features that may be successfully used in the construction of a model for the data. These methods are applied under the assumption that often many of the available features are redundant for the purpose of the analysis. In this paper, we focus on a particular method for feature selection in supervised learning problems, based on a linear programming model with integer variables. For the solution of the optimization problem associated with this approach, we propose a novel robust metaheuristics algorithm that relies on a Greedy Randomized Adaptive Search Procedure, extended with the adoption of short memory and a local search strategy. The performances of our heuristic algorithm are successfully compared with those of well-established feature selection methods, both on simulated and real data from biological applications. The obtained results suggest that our method is particularly suited for problems with a very large number of binary or categorical features.
Bioinformatics | 2016
Valerio Cestarelli; Giulia Fiscon; Giovanni Felici; Paola Bertolazzi; Emanuel Weitschek
Abstract Motivation: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case–control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class. Results: We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool. We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced. Availability and implementation: dmb.iasi.cnr.it/camur.php Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
BMC Bioinformatics | 2017
Fabio Cumbo; Giulia Fiscon; Stefano Ceri; Marco Masseroli; Emanuel Weitschek
BackgroundData extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types.ResultsWe propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). We also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format.ConclusionsThe availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments.
BMC Research Notes | 2014
Emanuel Weitschek; Daniele Santoni; Giulia Fiscon; Maria Cristina De Cola; Paola Bertolazzi; Giovanni Felici
BackgroundNext Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis.MethodsWe propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples.ResultsWe exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.ConclusionsAlignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).
Scientific Reports | 2017
Paola Paci; Teresa Colombo; Giulia Fiscon; Aymone Gurtner; Giulio Pavesi; Lorenzo Farina
SWItchMiner (SWIM) is a wizard-like software implementation of a procedure, previously described, able to extract information contained in complex networks. Specifically, SWIM allows unearthing the existence of a new class of hubs, called “fight-club hubs”, characterized by a marked negative correlation with their first nearest neighbors. Among them, a special subset of genes, called “switch genes”, appears to be characterized by an unusual pattern of intra- and inter-module connections that confers them a crucial topological role, interestingly mirrored by the evidence of their clinic-biological relevance. Here, we applied SWIM to a large panel of cancer datasets from The Cancer Genome Atlas, in order to highlight switch genes that could be critically associated with the drastic changes in the physiological state of cells or tissues induced by the cancer development. We discovered that switch genes are found in all cancers we studied and they encompass protein coding genes and non-coding RNAs, recovering many known key cancer players but also many new potential biomarkers not yet characterized in cancer context. Furthermore, SWIM is amenable to detect switch genes in different organisms and cell conditions, with the potential to uncover important players in biologically relevant scenarios, including but not limited to human cancer.
computational intelligence and data mining | 2014
Giulia Fiscon; Emanuel Weitschek; Giovanni Felici; Paola Bertolazzi; Simona De Salvo; Placido Bramanti; Maria Cristina De Cola
Alzheimers Disease (AD) and its preliminary stage - Mild Cognitive Impairment (MCI) - are the most widespread neurodegenerative disorders, and their investigation remains an open challenge. ElectroEncephalography (EEG) appears as a non-invasive and repeatable technique to diagnose brain abnormalities. Despite technical advances, the analysis of EEG spectra is usually carried out by experts that must manually perform laborious interpretations. Computational methods may lead to a quantitative analysis of these signals and hence to characterize EEG time series. The aim of this work is to achieve an automatic patients classification from the EEG biomedical signals involved in AD and MCI in order to support medical doctors in the right diagnosis formulation. The analysis of the biological EEG signals requires effective and efficient computer science methods to extract relevant information. Data mining, which guides the automated knowledge discovery process, is a natural way to approach EEG data analysis. Specifically, in our work we apply the following analysis steps: (i) pre-processing of EEG data; (ii) processing of the EEG-signals by the application of time-frequency transforms; and (iii) classification by means of machine learning methods. We obtain promising results from the classification of AD, MCI, and control samples that can assist the medical doctors in identifying the pathology.
Biodata Mining | 2016
Giulia Fiscon; Emanuel Weitschek; Eleonora Cella; Alessandra Lo Presti; Marta Giovanetti; Muhammed Babakir-Mina; Marco Ciotti; Massimo Ciccozzi; Alessandra Pierangeli; Paola Bertolazzi; Giovanni Felici
BackgroundContinuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods.ResultsWe propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences.ConclusionsWe discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification.Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions.
database and expert systems applications | 2015
Emanuel Weitschek; Giulia Fiscon; Giovanni Felici; Paola Bertolazzi
Leveraging advances in transcriptome profiling technologies (RNA-seq), biomedical scientists are collecting ever-increasing gene expression profiles data with low cost and high throughput. Therefore, automatic knowledge extraction methods are becoming essential to manage them. In this work, we present GELA (Gene Expression Logic Analyzer), a novel pipeline able to perform a knowledge discovery process in gene expression profiles data of RNA-seq. Firstly, we introduce the RNA-seq technologies, then, we illustrate our gene expression profiles data analysis method (including normalization, clustering, and classification), and finally, we test our knowledge extraction algorithm on the public RNA-seq data sets of Breast Cancer and Stomach Cancer, and on the public microarray data sets of Psoriasis and Multiple Sclerosis, obtaining in both cases promising results.