Andres Folleco
Florida Atlantic University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andres Folleco.
Information Sciences | 2014
C. Seiffert; Taghi M. Khoshgoftaar; Jason Van Hulse; Andres Folleco
Data mining techniques are commonly used to construct models for identifying software modules that are most likely to contain faults. In doing so, an organizations limited resources can be intelligently allocated with the goal of detecting and correcting the greatest number of faults. However, there are two characteristics of software quality datasets that can negatively impact the effectiveness of these models: class imbalance and class noise. Software quality datasets are, by their nature, imbalanced. That is, most of a software systems faults can be found in a small percentage of software modules. Therefore, the number of fault-prone, fp, examples (program modules) in a software project dataset is much smaller than the number of not fault-prone, nfp, examples. Data sampling techniques attempt to alleviate the problem of class imbalance by altering a training datasets distribution. A program module contains class noise if it is incorrectly labeled. While several studies have been performed to evaluate data sampling methods, the impact of class noise on these techniques has not been adequately addressed. This work presents a systematic set of experiments designed to investigate the impact of both class noise and class imbalance on classification models constructed to identify fault-prone program modules. We analyze the impact of class noise and class imbalance on 11 different learning algorithms (learners) as well as 7 different data sampling techniques. We identify which learners and which data sampling techniques are most robust when confronted with noisy and imbalanced data.
international conference on machine learning and applications | 2007
Taghi M. Khoshgoftaar; C. Seiffert; J. Van Hulse; Amri Napolitano; Andres Folleco
For both single probability estimation trees (PETs) and ensembles of such trees, commonly employed class probability estimates correct the observed relative class frequencies in each leaf to avoid anomalies caused by small sample sizes. The effect of such corrections in random forests of PETs is investigated, and the use of the relative class frequency is compared to using two corrected estimates, the Laplace estimate and the m-estimate. An experiment with 34 datasets from the UCI repository shows that estimating class probabilities using relative class frequency clearly outperforms both using the Laplace estimate and the m-estimate with respect to accuracy, area under the ROC curve (AUC) and Brier score. Hence, in contrast to what is commonly employed for PETs and ensembles of PETs, these results strongly suggest that a non-corrected probability estimate should be used in random forests of PETs. The experiment further shows that learning random forests of PETs using relative class frequency significantly outperforms learning random forests of classification trees (i.e., trees for which only an unweighted vote on the most probable class is counted) with respect to both accuracy and AUC, but that the latter is clearly ahead of the former with respect to Brier score.
world congress on computational intelligence | 2008
Andres Folleco; Taghi M. Khoshgoftaar; J. Van Hulse; Lofton A. Bullard
This study investigates the impact of increasing levels of simulated class noise on software quality classification. Class noise was injected into seven software engineering measurement datasets, and the performance of three learners, random forests, C4.5, and Naive Bayes, was analyzed. The random forest classifier was utilized for this study because of its strong performance relative to well-known and commonly-used classifiers such as C4.5 and Naive Bayes. Further, relatively little prior research in software quality classification has considered the random forest classifier. The experimental factors considered in this study were the level of class noise and the percent of minority instances injected with noise. The empirical results demonstrate that the random forest obtained the best and most consistent classification performance in all experiments.
information reuse and integration | 2006
Qiming Luo; Taghi M. Khoshgoftaar; Andres Folleco
Object classification is an important component in a complete visual surveillance system. In the context of coastline surveillance, we present an empirical study on classifying 402 instances of ship regions into 6 types based on their shape features. The ship regions were extracted from surveillance videos and the 6 types of ships as well as the ground truth classification labels were provided by human observers. The shape feature of each region was extracted using MPEG-7 region-based shape descriptor. We applied k nearest neighbor to classify ships based on the similarity of their shape features, and the classification accuracy based on stratified ten-fold cross validation is about 91%. The proposed classification procedure based on MPEG-7 region-based shape descriptor and k nearest neighbor algorithm is robust to noise and imperfect object segmentation. It can also be applied to the classification of other rigid objects, such as airplanes, vehicles, etc
information reuse and integration | 2008
Andres Folleco; Taghi M. Khoshgoftaar; Jason Van Hulse; Lofton A. Bullard
Real world datasets commonly contain noise that is distributed in both the independent and dependent variables. Noise, which typically consists of erroneous variable values, has been shown to significantly affect the classification performance of learners. In this study, we identify learners with robust performance in the presence of low quality (noisy) measurement data. Noise was injected into five class imbalanced software engineering measurement datasets, initially relatively free of noise. The experimental factors considered included the learner used, the level of injected noise, the dataset used (each with unique properties), and the percentage of minority instances containing noise. No other related studies were found that have identified learners that are robust in the presence of low quality measurement data. Based on the results of this study, we recommend using the random forest learner for building classification models from noisy data.
information reuse and integration | 2007
C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Andres Folleco
In the domain of software quality classification, data mining techniques are used to construct models (learners) for identifying software modules that are most likely to be fault-prone. The performance of these models, however, can be negatively affected by class imbalance and noise. Data sampling techniques have been proposed to alleviate the problem of class imbalance, but the impact of data quality on these techniques has not been adequately addressed. We examine the combined effects of noise and imbalance on classification performance when seven commonly-used sampling techniques are applied to software quality measurement data. Our results show that some sampling techniques are more robust in the presence of noise than others. Further, sampling techniques are affected by noise differently given different levels of imbalance.
international conference on machine learning and applications | 2008
Andres Folleco; Taghi M. Khoshgoftaar; Amri Napolitano
Erroneous attribute values can significantly impact learning from otherwise valuable data. The learning impact can be exacerbated by the class imbalanced training data. We investigate and compare the overall learning impact of sampling such data by using four distinct performance metrics suitable for models built from binary class imbalanced data. Seven relatively free of noise, class imbalanced software engineering measurement datasets were used. A novel noise injection procedure was applied to these datasets. We injected domain realistic noise into the independent and dependent (class) attributes of randomly selected instances to simulate lower quality measurement data. Seven well known data sampling techniques with the benchmark decision-tree learner C4.5 were used. No other related studies were found that have comprehensively investigated learning by sampling low quality binary class imbalanced data containing both independent and dependent corrupted attributes. Two sampling techniques (random undersampling and Wilsons editing) with better and more robust learning performances were identified. In contrast, all metrics concurred on the identification of the worst performing sampling technique (cluster-based oversampling).
information reuse and integration | 2006
Taghi M. Khoshgoftaar; Andres Folleco; Jason Van Hulse; Lofton A. Bullard
The detrimental effects of noise in a dependent variable on the accuracy of software quality imputation techniques were studied. The imputation techniques used in this work were Bayesian multiple imputation, mean imputation, instance-based learning, regression imputation, and the REPTree decision tree. These techniques were used to obtain software quality imputations for a large military command, control, and communications system dataset (CCCS). The underlying quality of data was a significant factor affecting the accuracy of the imputation techniques. Multiple imputation and regression imputation were top performers, while mean imputation was ineffective
international symposium on visual computing | 2007
Xiaoyuan Su; Taghi M. Khoshgoftaar; Xingquan Zhu; Andres Folleco
In order to address the challenges of occlusions and background variations, we propose a novel and effective rule-based multiple object tracking system for traffic surveillance using a collaborative background extraction algorithm. The collaborative background extraction algorithm collaboratively extracts a background from multiple independent extractions to remove spurious background pixels. The rule-based strategies are applied for thresholding, outlier removal, object consolidation, separating neighboring objects, and shadow removal. Empirical results show that our multiple object tracking system is highly accurate for traffic surveillance under occlusion conditions.
Archive | 2008
Andres Folleco; Taghi M. Khoshgoftaar; Jason Van Hulse
This study examines the impact of noise on the evaluation of software quality imputation techniques. The imputation procedures evaluated in this work include Bayesian multiple imputation, mean imputation, nearest neighbor imputation, regression imputation, and REPTree (decision tree) imputation. These techniques were used to impute missing software measurement data for a large military command, control, and communications system dataset (CCCS). A randomized three-way complete block design analysis of variance model using the average absolute error as the response variable was built to analyze the imputation results. Multiple pairwise comparisons using Fisher and Tukey-Kramer tests were conducted to demonstrate the performance differences amongst the significant experimental factors. The underlying quality of data was a significant factor affecting the accuracy of the imputation techniques. Bayesian multiple imputation and regression imputation were top performers, while mean imputation was ineffective.