Amri Napolitano | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amri Napolitano is active.

Explore More

Publication

Featured researches published by Amri Napolitano.

systems man and cybernetics | 2010

RUSBoost: A Hybrid Approach to Alleviating Class Imbalance

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

Class imbalance is a problem that is common to many application domains. When examples of one class in a training data set vastly outnumber examples of the other class(es), traditional data mining algorithms tend to create suboptimal classification models. Several techniques have been used to alleviate the problem of class imbalance, including data sampling and boosting. In this paper, we present a new hybrid sampling/boosting algorithm, called RUSBoost, for learning from skewed training data. This algorithm provides a simpler and faster alternative to SMOTEBoost, which is another algorithm that combines boosting and data sampling. This paper evaluates the performances of RUSBoost and SMOTEBoost, as well as their individual components (random undersampling, synthetic minority oversampling technique, and AdaBoost). We conduct experiments using 15 data sets from various application domains, four base learners, and four evaluation metrics. RUSBoost and SMOTEBoost both outperform the other procedures, and RUSBoost performs comparably to (and often better than) SMOTEBoost while being a simpler and faster technique. Given these experimental results, we highly recommend RUSBoost as an attractive alternative for improving the classification performance of learners built using imbalanced data.

international conference on machine learning | 2007

Experimental perspectives on learning from imbalanced data

Jason Van Hulse; Taghi M. Khoshgoftaar; Amri Napolitano

We present a comprehensive suite of experimentation on the subject of learning from imbalanced data. When classes are imbalanced, many learning algorithms can suffer from the perspective of reduced performance. Can data sampling be used to improve the performance of learners built from imbalanced data? Is the effectiveness of sampling related to the type of learner? Do the results change if the objective is to optimize different performance metrics? We address these and other issues in this work, showing that sampling in many cases will improve classifier performance.

systems man and cybernetics | 2011

Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data

Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

This paper compares the performance of several boosting and bagging techniques in the context of learning from imbalanced and noisy binary-class data. Noise and class imbalance are two well-established data characteristics encountered in a wide range of data mining and machine learning initiatives. The learning algorithms studied in this paper, which include SMOTEBoost, RUSBoost, Exactly Balanced Bagging, and Roughly Balanced Bagging, combine boosting or bagging with data sampling to make them more effective when data are imbalanced. These techniques are evaluated in a comprehensive suite of experiments, for which nearly four million classification models were trained. All classifiers are assessed using seven different performance metrics, providing a complete perspective on the performance of these techniques, and results are tested for statistical significance via analysis-of-variance modeling. The experiments show that the bagging techniques generally outperform boosting, and hence in noisy data environments, bagging is the preferred method for handling class imbalance.

international conference on data mining | 2009

Feature Selection with High-Dimensional Imbalanced Data

Jason Van Hulse; Taghi M. Khoshgoftaar; Amri Napolitano; Randall Wald

Feature selection is an important topic in data mining, especially for high dimensional datasets. Filtering techniques in particular have received much attention, but detailed comparisons of their performance is lacking. This work considers three filters using classifier performance metrics and six commonly-used filters. All nine filtering techniques are compared and contrasted using five different microarray expression datasets. In addition, given that these datasets exhibit an imbalance between the number of positive and negative examples, the utilization of sampling techniques in the context of feature selection is examined.

information reuse and integration | 2012

A review of the stability of feature selection techniques for bioinformatics data

Wael Awada; Taghi M. Khoshgoftaar; David J. Dittman; Randall Wald; Amri Napolitano

Feature selection is an important step in data mining and is used in various domains including genetics, medicine, and bioinformatics. Choosing the important features (genes) is essential for the discovery of new knowledge hidden within the genetic code as well as the identification of important biomarkers. Although feature selection methods can help sort through large numbers of genes based on their relevance to the problem at hand, the results generated tend to be unstable and thus cannot be reproduced in other experiments. Relatedly, research interest in the stability of feature ranking methods has grown recently and researchers have produced experimental designs for testing the stability of feature selection, creating new metrics for measuring stability and new techniques designed to improve the stability of the feature selection process. In this paper, we will introduce the role of stability in feature selection with DNA microarray data. We list various ways of improving feature ranking stability, and discuss feature selection techniques, specifically explaining ensemble feature ranking and presenting various ensemble feature ranking aggregation methods. Finally, we discuss experimental procedures such as dataset perturbation, fixed overlap partitioning, and cross validation procedures that help researchers analyze and measure the stability of feature ranking methods. Throughout this work, we investigate current research in the field and discuss possible avenues of continuing such research efforts.

international conference on pattern recognition | 2008

RUSBoost: Improving classification performance when training data is skewed

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

Constructing classification models using skewed training data can be a challenging task. We present RUSBoost, a new algorithm for alleviating the problem of class imbalance. RUSBoost combines data sampling and boosting, providing a simple and efficient method for improving classification performance when training data is imbalanced. In addition to performing favorably when compared to SMOTEBoost (another hybrid sampling/boosting algorithm), RUSBoost is computationally less expensive than SMOTEBoost and results in significantly shorter model training times. This combination of simplicity, speed and performance makes RUSBoost an excellent technique for learning from imbalanced data.

international conference on machine learning and applications | 2007

Learning with limited minority class data

Taghi M. Khoshgoftaar; C. Seiffert; J. Van Hulse; Amri Napolitano; Andres Folleco

For both single probability estimation trees (PETs) and ensembles of such trees, commonly employed class probability estimates correct the observed relative class frequencies in each leaf to avoid anomalies caused by small sample sizes. The effect of such corrections in random forests of PETs is investigated, and the use of the relative class frequency is compared to using two corrected estimates, the Laplace estimate and the m-estimate. An experiment with 34 datasets from the UCI repository shows that estimating class probabilities using relative class frequency clearly outperforms both using the Laplace estimate and the m-estimate with respect to accuracy, area under the ROC curve (AUC) and Brier score. Hence, in contrast to what is commonly employed for PETs and ensembles of PETs, these results strongly suggest that a non-corrected probability estimate should be used in random forests of PETs. The experiment further shows that learning random forests of PETs using relative class frequency significantly outperforms learning random forests of classification trees (i.e., trees for which only an unweighted vote on the most probable class is counted) with respect to both accuracy and AUC, but that the latter is clearly ahead of the former with respect to Brier score.

information reuse and integration | 2011

A comparative evaluation of feature ranking methods for high dimensional bioinformatics data

Jason Van Hulse; Taghi M. Khoshgoftaar; Amri Napolitano

Feature selection is an important component of data mining analysis with high dimensional data. Reducing the number of features in the dataset can have numerous positive implications, such as eliminating redundant or irrelevant features, decreasing development time and improving the performance of classification models. In this work, four filter-based feature selection techniques are compared using a wide variety of bioinformatics datasets. The first three filters, χ2, Relief-F and Information Gain, are widely used techniques that are well known to many researchers and practitioners. The fourth filter, recently proposed by our research group and denoted TBFS-AUC (i.e., Threshold-Based Feature Selection technique with the AUC metric), is compared to these three commonly-used techniques using three different classification performance metrics. The empirical results demonstrate the strong performance of our technique.

international conference on machine learning and applications | 2010

A Comparative Study of Ensemble Feature Selection Techniques for Software Defect Prediction

Huanjing Wang; Taghi M. Khoshgoftaar; Amri Napolitano

Feature selection has become the essential step in many data mining applications. Using a single feature subset selection method may generate local optima. Ensembles of feature selection methods attempt to combine multiple feature selection methods instead of using a single one. We present a comprehensive empirical study examining 17 different ensembles of feature ranking techniques (rankers) including six commonly-used feature ranking techniques, the signal-to-noise filter technique, and 11 threshold-based feature ranking techniques. This study utilized 16 real-world software measurement data sets of different sizes and built 13,600 classification models. Experimental results indicate that ensembles of very few rankers are very effective and even better than ensembles of many or all rankers.

IEEE Transactions on Neural Networks | 2010

Supervised Neural Network Modeling: An Empirical Investigation Into Learning From Imbalanced Data With Labeling Errors

Taghi M. Khoshgoftaar; Jason Van Hulse; Amri Napolitano

Neural network algorithms such as multilayer perceptrons (MLPs) and radial basis function networks (RBFNets) have been used to construct learners which exhibit strong predictive performance. Two data related issues that can have a detrimental impact on supervised learning initiatives are class imbalance and labeling errors (or class noise). Imbalanced data can make it more difficult for the neural network learning algorithms to distinguish between examples of the various classes, and class noise can lead to the formulation of incorrect hypotheses. Both class imbalance and labeling errors are pervasive problems encountered in a wide variety of application domains. Many studies have been performed to investigate these problems in isolation, but few have focused on their combined effects. This study presents a comprehensive empirical investigation using neural network algorithms to learn from imbalanced data with labeling errors. In particular, the first component of our study investigates the impact of class noise and class imbalance on two common neural network learning algorithms, while the second component considers the ability of data sampling (which is commonly used to address the issue of class imbalance) to improve their performances. Our results, for which over two million models were trained and evaluated, show that conclusions drawn using the more commonly studied C4.5 classifier may not apply when using neural networks.

Explore More