Emanuel Weitschek
National Research Council
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Emanuel Weitschek.
PLOS ONE | 2012
Robin van Velzen; Emanuel Weitschek; Giovanni Felici; Freek T. Bakker
Recently diverged species are challenging for identification, yet they are frequently of special interest scientifically as well as from a regulatory perspective. DNA barcoding has proven instrumental in species identification, especially in insects and vertebrates, but for the identification of recently diverged species it has been reported to be problematic in some cases. Problems are mostly due to incomplete lineage sorting or simply lack of a ‘barcode gap’ and probably related to large effective population size and/or low mutation rate. Our objective was to compare six methods in their ability to correctly identify recently diverged species with DNA barcodes: neighbor joining and parsimony (both tree-based), nearest neighbor and BLAST (similarity-based), and the diagnostic methods DNA-BAR, and BLOG. We analyzed simulated data assuming three different effective population sizes as well as three selected empirical data sets from published studies. Results show, as expected, that success rates are significantly lower for recently diverged species (∼75%) than for older species (∼97%) (P<0.00001). Similarity-based and diagnostic methods significantly outperform tree-based methods, when applied to simulated DNA barcode data (P<0.00001). The diagnostic method BLOG had highest correct query identification rate based on simulated (86.2%) as well as empirical data (93.1%), indicating that it is a consistently better method overall. Another advantage of BLOG is that it offers species-level information that can be used outside the realm of DNA barcoding, for instance in species description or molecular detection assays. Even though we can confirm that identification success based on DNA barcoding is generally high in our data, recently diverged species remain difficult to identify. Nevertheless, our results contribute to improved solutions for their accurate identification.
BMC Bioinformatics | 2009
Paola Bertolazzi; Giovanni Felici; Emanuel Weitschek
BackgroundAccording to many field experts, specimens classification based on morphological keys needs to be supported with automated techniques based on the analysis of DNA fragments. The most successful results in this area are those obtained from a particular fragment of mitochondrial DNA, the gene cytochrome c oxidase I (COI) (the barcode). Since 2004 the Consortium for the Barcode of Life (CBOL) promotes the collection of barcode specimens and the development of methods to analyze the barcode for several tasks, among which the identification of rules to correctly classify an individual into its species by reading its barcode.ResultsWe adopt a Logic Mining method based on two optimization models and present the results obtained on two datasets where a number of COI fragments are used to describe the individuals that belong to different species. The method proposed exhibits high correct recognition rates on a training-testing split of the available data using a small proportion of the information available (e.g., correct recognition approx. 97% when only 20 sites of the 648 available are used). The method is able to provide compact formulas on the values (A, C, G, T) at the selected sites that synthesize the characteristic of each species, a relevant information for taxonomists.ConclusionWe have presented a Logic Mining technique designed to analyze barcode data and to provide detailed output of interest to the taxonomists and the barcode community represented in the CBOL Consortium. The method has proven to be effective, efficient and precise.
Journal of Alzheimer's Disease | 2011
Ivan Arisi; Mara D'Onofrio; Rossella Brandi; Armando Felsani; Simona Capsoni; Guido Drovandi; Giovanni Felici; Emanuel Weitschek; Paola Bertolazzi; Antonino Cattaneo
The identification of early and stage-specific biomarkers for Alzheimers disease (AD) is critical, as the development of disease-modification therapies may depend on the discovery and validation of such markers. The identification of early reliable biomarkers depends on the development of new diagnostic algorithms to computationally exploit the information in large biological datasets. To identify potential biomarkers from mRNA expression profile data, we used the Logic Mining method for the unbiased analysis of a large microarray expression dataset from the anti-NGF AD11 transgenic mouse model. The gene expression profile of AD11 brain regions was investigated at different neurodegeneration stages by whole genome microarrays. A new implementation of the Logic Mining method was applied both to early (1-3 months) and late stage (6-15 months) expression data, coupled to standard statistical methods. A small number of fingerprinting formulas was isolated, encompassing mRNAs whose expression levels were able to discriminate between diseased and control mice. We selected three differential signature genes specific for the early stage (Nudt19, Arl16, Aph1b), five common to both groups (Slc15a2, Agpat5, Sox2ot, 2210015, D19Rik, Wdfy1), and seven specific for late stage (D14Ertd449, Tia1, Txnl4, 1810014B01Rik, Snhg3, Actl6a, Rnf25). We suggest these genes as potential biomarkers for the early and late stage of AD-like neurodegeneration in this model and conclude that Logic Mining is a powerful and reliable approach for large scale expression data analysis. Its application to large expression datasets from brain or peripheral human samples may facilitate the discovery of early and stage-specific AD biomarkers.
Molecular Ecology Resources | 2013
Emanuel Weitschek; Robin van Velzen; Giovanni Felici; Paola Bertolazzi
BLOG (Barcoding with LOGic) is a diagnostic and character‐based DNA Barcode analysis method. Its aim is to classify specimens to species based on DNA Barcode sequences and on a supervised machine learning approach, using classification rules that compactly characterize species in terms of DNA Barcode locations of key diagnostic nucleotides. The BLOG 2.0 software, its fundamental modules, online/offline user interfaces and recent improvements are described. These improvements affect both methodology and software design, and lead to the availability of different releases on the website http://dmb.iasi.cnr.it/blog-downloads.php. Previous and new experimental tests show that BLOG 2.0 outperforms previous versions as well as other DNA Barcode analysis methods.
Biodata Mining | 2014
Emanuel Weitschek; Giulia Fiscon; Giovanni Felici
BackgroundSpecific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms.MethodsIn this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods.ResultsA software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods.ConclusionsThe classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community.
European Journal of Operational Research | 2016
Paola Bertolazzi; Giovanni Felici; Paola Festa; Giulia Fiscon; Emanuel Weitschek
Feature selection methods are used in machine learning and data analysis to select a subset of features that may be successfully used in the construction of a model for the data. These methods are applied under the assumption that often many of the available features are redundant for the purpose of the analysis. In this paper, we focus on a particular method for feature selection in supervised learning problems, based on a linear programming model with integer variables. For the solution of the optimization problem associated with this approach, we propose a novel robust metaheuristics algorithm that relies on a Greedy Randomized Adaptive Search Procedure, extended with the adoption of short memory and a local search strategy. The performances of our heuristic algorithm are successfully compared with those of well-established feature selection methods, both on simulated and real data from biological applications. The obtained results suggest that our method is particularly suited for problems with a very large number of binary or categorical features.
Genomics | 2014
Dimitris Polychronopoulos; Emanuel Weitschek; Slavica Dimitrieva; Philipp Bucher; Giovanni Felici; Yannis Almirantis
Scarce work has been done in the analysis of the composition of conserved non-coding elements (CNEs) that are identified by comparisons of two or more genomes and are found to exist in all metazoan genomes. Here we present the analysis of CNEs with a methodology that takes into account word occurrence at various lengths scales in the form of feature vector representation and rule based classifiers. We implement our approach on both protein-coding exons and CNEs, originating from human, insect (Drosophila melanogaster) and worm (Caenorhabditis elegans) genomes, that are either identified in the present study or obtained from the literature. Alignment free feature vector representation of sequences combined with rule-based classification methods leads to successful classification of the different CNEs classes. Biologically meaningful results are derived by comparison with the genomic signatures approach, and classification rates for a variety of functional elements of the genomes along with surrogates are presented.
Bioinformatics | 2016
Valerio Cestarelli; Giulia Fiscon; Giovanni Felici; Paola Bertolazzi; Emanuel Weitschek
Abstract Motivation: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case–control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class. Results: We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool. We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced. Availability and implementation: dmb.iasi.cnr.it/camur.php Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
database and expert systems applications | 2013
Emanuel Weitschek; Giovanni Felici; Paola Bertolazzi
The wide spread of electronic data collection in medical environments leads to an exponential growth of clinical data extracted from heterogeneous patient samples. Collecting, managing, integrating and analyzing these data are essential activities in order to shed light on diseases and on related therapies. The major issues in clinical data analysis are the incompleteness (missing values), the different adopted measure scales, the integration of the disparate collection procedures. Therefore, the main challenges are in managing clinical data, in discovering patients interactions, and in integrating the different data sources. The final goal is to extract relevant information from huge amounts of clinical data. Therefore, the analysis of clinical data requires new effective and efficient methods to extract compact and relevant information: the interdisciplinary field of data mining, which guides the automated knowledge discovery process, is a natural way to approach the complex task of clinical data analysis. Data mining deals with structured and unstructured data, that are, respectively, data for which we can give a model or not. For example, in clinical contexts it is important to highlight those trials (variables) that are frequent in a particular disease diagnosis. The objective of this work is to study and apply methods to manage and retrieve relevant information in clinical data sets. A practical analysis from real patient data collected from several dementia clinical departments in Italy is reported as example of clinical data mining. The particular field of logic classification, where a data model is computed in form of propositional logic formulas, is investigated for clinical data mining and compared to other techniques, showing that it is a successful approach to compute a compact data model for clinical knowledge discovery.
database and expert systems applications | 2012
Emanuel Weitschek; Giovanni Felici; Paola Bertolazzi
Microarray Logic Analyzer (MALA) is a clustering and classification software, particularly engineered for microarray gene expression analysis. The aims of MALA are to cluster the microarray gene expression profiles in order to reduce the amount of data to be analyzed and to classify the microarray experiments. To fulfil this objective MALA uses a machine learning process based methodology, that relies on 1) Discretization, 2) Gene clustering, 3) Feature selection, 4) Formulas computation,5) Classification. In this paper we describe the methodology, the software design, the different releases and user interfaces of MALA. We also emphasize its strengths: the identification of classification formulas that are able to precisely describe in a compact way the different classes of the microarray experiments. Finally, we show the experimental results obtained on a real microarray data set coming from Alzheimer diseased versus control mice microarray probes, and conclude that MALA is a powerful and reliable software for microarray gene expression analysis.