Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Stefano Merler is active.

Publication


Featured researches published by Stefano Merler.


BMC Bioinformatics | 2003

Entropy-based gene ranking without selection bias for the predictive classification of microarray data.

Cesare Furlanello; Maria Serafini; Stefano Merler; Giuseppe Jurman

BackgroundWe describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process).ResultsWith E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipfs law profiles.ConclusionsWithout a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.


Bioinformatics | 2008

Algebraic stability indicators for ranked lists in molecular profiling

Giuseppe Jurman; Stefano Merler; Annalisa Barla; Silvano Paoli; Antonio Galea; Cesare Furlanello

MOTIVATIONnWe propose a method for studying the stability of biomarker lists obtained from functional genomics studies. It is common to adopt resampling methods to tune and evaluate marker-based diagnostic and prognostic systems in order to prevent selection bias. Such caution promotes honest estimation of class prediction, but leads to alternative sets of solutions. In microarray studies, the difference in lists may be bewildering, also due to the presence of modules of functionally related genes. Methods for assessing stability understand the dependency of the markers on the data or on the predictors type and help selecting solutions.nnnRESULTSnA computational framework for comparing sets of ranked biomarker lists is presented. Notions and algorithms are based on concepts from permutation group theory. We introduce several algebraic indicators and metric methods for symmetric groups, including the Canberra distance, a weighted version of Spearmans footrule. We also consider distances between partial lists and an aggregation of sets of lists into an optimal list based on voting theory (Borda count). The stability indicators are applied in practical situations to several synthetic, cancer microarray and proteomics datasets. The addressed issues are predictive classification, presence of modules, comparison of alternative biomarker lists, outlier removal, control of selection bias by randomization techniques and enrichment analysis.nnnAVAILABILITYnSupplementary Material and software are available at the address http://biodcv.fbk.eu/listspy.html


Briefings in Bioinformatics | 2007

Machine learning methods for predictive proteomics

Annalisa Barla; Giuseppe Jurman; Samantha Riccadonna; Stefano Merler; Marco Chierici; Cesare Furlanello

The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10(3) times, but it is easy to neglect information-leakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers e.g. Support Vector Machine (SVM) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies.


international symposium on neural networks | 2003

An accelerated procedure for recursive feature ranking on microarray data

Cesare Furlanello; Maria Serafini; Stefano Merler; Giuseppe Jurman

We describe a new wrapper algorithm for fast feature ranking in classification problems. The Entropy-based Recursive Feature Elimination (E-RFE) method eliminates chunks of uninteresting features according to the entropy of the weights distribution of a SVM classifier. With specific regard to DNA microarray datasets, the method is designed to support computationally intensive model selection in classification problems in which the number of features is much larger than the number of samples. We test E-RFE on synthetic and real data sets, comparing it with other SVM-based methods. The speed-up obtained with E-RFE supports predictive modeling on high dimensional microarray data.


Journal of Medical Entomology | 2002

Geographical Information Systems and Bootstrap Aggregation (Bagging) of Tree-Based Classifiers for Lyme Disease Risk Prediction in Trentino, Italian Alps

Annapaola Rizzoli; Stefano Merler; Cesare Furlanello; Claudio Genchi

Abstract The risk of exposure to Lyme disease in the province of Trento, Italian Alps, was predicted through the analysis of the distribution of Ixodes ricinus (L.) nymphs infected with Borrelia burgdorferi s.l. with a model based on bootstrap aggregation (bagging) of tree-based classifiers within a geographical information system (GIS). Data on I. ricinus density assessed by dragging the vegetation in 438 sites during 1996 were cross-correlated with the digital cartography of a GIS, which included the variables altitude, exposure and slope, substratum, vegetation type and roe deer density. Ticks were more abundant at altitudes below 1,300 m a.s.l., in the presence of limestone and vegetation cover with thermophile deciduous forests and high densities of roe deer. A bootstrap aggregation procedure (bagging) was used to produce a model for the prediction of tick occurrence, the accuracy of which was tested on actual tick counts assessed by a further dragging campaign carried out during 1997 to determine infection prevalence and resulted in average 77%. Other tests of the model were made on additional and independent data sets. The prevalence of infection with Borrelia burgdorferi s.l, determined by polymerase chain reaction on 2,208 nymphs collected by random dragging in 245 transects selected within eight areas where the model predicted the occurrence of I. ricinus during 1997, was 17.5% and was positively correlated to tick abundance and roe deer density. These findings were used to relate the output of the bagged model (probability of tick occurrence) to the density of infected nymphs through a stepwise model selection procedure and thus to produce a GIS digital map of the probability distribution of infected nymphs in the Province of Trento at high resolution scale (50 by 50-m cell resolution). The application of the bagging procedure increased the accuracy of the prediction made by a single classification tree, a well-known classification method for the analysis of epidemiological data.


Computational Statistics & Data Analysis | 2007

Parallelizing AdaBoost by weights dynamics

Stefano Merler; Bruno Caprile; Cesare Furlanello

AdaBoost is one of the most popular classification methods. In contrast to other ensemble methods (e.g., Bagging) the AdaBoost is inherently sequential. In many data intensive real-world situations this may limit the practical applicability of the method. P-AdaBoost is a novel scheme for the parallelization of AdaBoost, which builds upon earlier results concerning the dynamics of AdaBoost weights. P-AdaBoost yields approximations to the standard AdaBoost models that can be easily and efficiently distributed over a network of computing nodes. Properties of P-AdaBoost as a stochastic minimizer of the AdaBoost cost functional are discussed. Experiments are reported on both synthetic and benchmark data sets.


IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2005

Semisupervised Learning for Molecular Profiling

Cesare Furlanello; Maria Serafini; Stefano Merler; Giuseppe Jurman

Class prediction and feature selection are two learning tasks that are strictly paired in the search of molecular profiles from microarray data. Researchers have become aware how easy it is to incur a selection bias effect, and complex validation setups are required to avoid overly optimistic estimates of the predictive accuracy of the models and incorrect gene selections. This paper describes a semisupervised pattern discovery approach that uses the by-products of complete validation studies on experimental setups for gene profiling. In particular, we introduce the study of the patterns of single sample responses (sample-tracking profiles) to the gene selection process induced by typical supervised learning tasks in microarray studies. We originate sample-tracking profiles as the aggregated off-training evaluation of SVM models of increasing gene panel sizes. Genes are ranked by E-RFE, an entropy-based variant of the recursive feature elimination for support vector machines (RFE-SVM). A dynamic time warping (DTW) algorithm is then applied to define a metric between sample-tracking profiles. An unsupervised clustering based on the DTW metric allows automating the discovery of outliers and of subtypes of different molecular profiles. Applications are described on synthetic data and in two gene expression studies.


Information Fusion | 2003

Automatic model selection in cost-sensitive boosting

Stefano Merler; Cesare Furlanello; Barbara Larcher; Andrea Sboner

Abstract This paper introduces SSTBoost, a predictive classification methodology designed to target the accuracy of a modified boosting algorithm towards required sensitivity and specificity constraints. The SSTBoost method is demonstrated in practice for the automated medical diagnosis of cancer on a set of skin lesions (42 melanomas and 110 naevi) described by geometric and colorimetric features. A cost-sensitive variant of the AdaBoost algorithm is combined with a procedure for the automatic selection of optimal cost parameters. Within each boosting step, different weights are considered for errors on false negatives and false positives, and differently updated for negatives and positives. Given only a target region in the ROC space, the method also completely automates the selection of the cost parameters ratio, tipically of uncertain definition. On the cancer diagnosis problem, SSTBoost outperformed in accuracy and stability a battery of specialized automatic systems based on different types of multiple classifier combinations and a panel of expert dermatologists. The method thus can be applied for the early diagnosis of melanoma cancer or in other problems in which an automated cost-sensitive classification is required.


IEEE Transactions on Signal Processing | 2006

Combining feature selection and DTW for time-varying functional genomics

Cesare Furlanello; Stefano Merler; Giuseppe Jurman

Given temporal high-throughput data defining a two-class functional genomic process, feature selection algorithms may be applied to extract a panel of discriminating gene time series. We aim to identify the main trends of activity through time. A reconstruction method based on stagewise boosting is endowed with a similarity measure based on the dynamic time warping (DTW) algorithm, defining a ranked set of time-series component contributing most to the reconstruction. The approach is applied on synthetic and public microarray data. On the Cardiogenomics PGA Mouse Model of Myocardial Infarction, the approach allows the identification of a time-varying molecular profile of the ventricular remodeling process.


International Journal of Pattern Recognition and Artificial Intelligence | 2004

BIAS-VARIANCE CONTROL VIA HARD POINTS SHAVING

Stefano Merler; Bruno Caprile; Cesare Furlanello

In this paper, we propose a regularization technique for AdaBoost. The method implements a bias-variance control strategy in order to avoid overfitting in classification tasks on noisy data. The method is based on a notion of easy and hard training patterns as emerging from analysis of the dynamical evolutions of AdaBoost weights. The procedure consists in sorting the training data points by a hardness measure, and in progressively eliminating the hardest, stopping at an automatically selected threshold. Effectiveness of the method is tested and discussed on synthetic as well as real data.

Collaboration


Dive into the Stefano Merler's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bruno Caprile

fondazione bruno kessler

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Antonino Bella

Istituto Superiore di Sanità

View shared research outputs
Top Co-Authors

Avatar

Caterina Rizzo

Istituto Superiore di Sanità

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge