J. Van Hulse | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where J. Van Hulse is active.

Explore More

Publication

Featured researches published by J. Van Hulse.

systems man and cybernetics | 2010

RUSBoost: A Hybrid Approach to Alleviating Class Imbalance

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

Class imbalance is a problem that is common to many application domains. When examples of one class in a training data set vastly outnumber examples of the other class(es), traditional data mining algorithms tend to create suboptimal classification models. Several techniques have been used to alleviate the problem of class imbalance, including data sampling and boosting. In this paper, we present a new hybrid sampling/boosting algorithm, called RUSBoost, for learning from skewed training data. This algorithm provides a simpler and faster alternative to SMOTEBoost, which is another algorithm that combines boosting and data sampling. This paper evaluates the performances of RUSBoost and SMOTEBoost, as well as their individual components (random undersampling, synthetic minority oversampling technique, and AdaBoost). We conduct experiments using 15 data sets from various application domains, four base learners, and four evaluation metrics. RUSBoost and SMOTEBoost both outperform the other procedures, and RUSBoost performs comparably to (and often better than) SMOTEBoost while being a simpler and faster technique. Given these experimental results, we highly recommend RUSBoost as an attractive alternative for improving the classification performance of learners built using imbalanced data.

international conference on tools with artificial intelligence | 2007

An Empirical Study of Learning from Imbalanced Data Using Random Forest

Taghi M. Khoshgoftaar; M. Golawala; J. Van Hulse

This paper discusses a comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka. RF is a relatively new learner, and to the best of our knowledge, only preliminary experimentation on the construction of random forest classifiers in the context of imbalanced data has been reported in previous work. Therefore, the contribution of this study is to provide an extensive empirical evaluation of RF learners built from imbalanced data. What should be the recommended default number of trees in the ensemble? What should the recommended value be for the number of attributes? How does the RF learner perform on imbalanced data when compared with other commonly-used learners? We address these and other related issues in this work.

systems man and cybernetics | 2011

Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data

Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

This paper compares the performance of several boosting and bagging techniques in the context of learning from imbalanced and noisy binary-class data. Noise and class imbalance are two well-established data characteristics encountered in a wide range of data mining and machine learning initiatives. The learning algorithms studied in this paper, which include SMOTEBoost, RUSBoost, Exactly Balanced Bagging, and Roughly Balanced Bagging, combine boosting or bagging with data sampling to make them more effective when data are imbalanced. These techniques are evaluated in a comprehensive suite of experiments, for which nearly four million classification models were trained. All classifiers are assessed using seven different performance metrics, providing a complete perspective on the performance of these techniques, and results are tested for statistical significance via analysis-of-variance modeling. The experiments show that the bagging techniques generally outperform boosting, and hence in noisy data environments, bagging is the preferred method for handling class imbalance.

systems man and cybernetics | 2009

Improving Software-Quality Predictions With Data Sampling and Boosting

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse

Software-quality data sets tend to fall victim to the class-imbalance problem that plagues so many other application domains. The majority of faults in a software system, particularly high-assurance systems, usually lie in a very small percentage of the software modules. This imbalance between the number of fault-prone (fp) and non-fp (nfp) modules can have a severely negative impact on a data-mining techniques ability to differentiate between the two. This paper addresses the class-imbalance problem as it pertains to the domain of software-quality prediction. We present a comprehensive empirical study examining two different methodologies, data sampling and boosting, for improving the performance of decision-tree models designed to identify fp software modules. This paper applies five data-sampling techniques and boosting to 15 software-quality data sets of different sizes and levels of imbalance. Nearly 50 000 models were built for the experiments contained in this paper. Our results show that while data-sampling techniques are very effective in improving the performance of such models, boosting almost always outperforms even the best data-sampling techniques. This significant result, which, to our knowledge, has not been previously reported, has important consequences for practitioners developing software-quality classification models.

international conference on pattern recognition | 2008

RUSBoost: Improving classification performance when training data is skewed

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

Constructing classification models using skewed training data can be a challenging task. We present RUSBoost, a new algorithm for alleviating the problem of class imbalance. RUSBoost combines data sampling and boosting, providing a simple and efficient method for improving classification performance when training data is imbalanced. In addition to performing favorably when compared to SMOTEBoost (another hybrid sampling/boosting algorithm), RUSBoost is computationally less expensive than SMOTEBoost and results in significantly shorter model training times. This combination of simplicity, speed and performance makes RUSBoost an excellent technique for learning from imbalanced data.

international conference on machine learning and applications | 2007

Learning with limited minority class data

Taghi M. Khoshgoftaar; C. Seiffert; J. Van Hulse; Amri Napolitano; Andres Folleco

For both single probability estimation trees (PETs) and ensembles of such trees, commonly employed class probability estimates correct the observed relative class frequencies in each leaf to avoid anomalies caused by small sample sizes. The effect of such corrections in random forests of PETs is investigated, and the use of the relative class frequency is compared to using two corrected estimates, the Laplace estimate and the m-estimate. An experiment with 34 datasets from the UCI repository shows that estimating class probabilities using relative class frequency clearly outperforms both using the Laplace estimate and the m-estimate with respect to accuracy, area under the ROC curve (AUC) and Brier score. Hence, in contrast to what is commonly employed for PETs and ensembles of PETs, these results strongly suggest that a non-corrected probability estimate should be used in random forests of PETs. The experiment further shows that learning random forests of PETs using relative class frequency significantly outperforms learning random forests of classification trees (i.e., trees for which only an unweighted vote on the most probable class is counted) with respect to both accuracy and AUC, but that the latter is clearly ahead of the former with respect to Brier score.

international conference on tools with artificial intelligence | 2008

Resampling or Reweighting: A Comparison of Boosting Implementations

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

Boosting has been shown to improve the performance of classifiers in many situations, including when data is imbalanced. There are, however, two possible implementations of boosting, and it is unclear which should be used. Boosting by reweighting is typically used, but can only be applied to base learners which are designed to handle example weights. On the other hand, boosting by resampling can be applied to any base learner. In this work, we empirically evaluate the differences between these two boosting implementations using imbalanced training data. Using 10 boosting algorithms, 4 learners and 15 datasets, we find that boosting by resampling performs as well as, or significantly better than, boosting by reweighting (which is often the default boosting implementation). We therefore conclude that in general, boosting by resampling is preferred over boosting by weighting.

information reuse and integration | 2005

Empirical case studies in attribute noise detection

Taghi M. Khoshgoftaar; J. Van Hulse

The quality of data is an important issue in any domain-specific data mining and knowledge discovery initiative. The validity of solutions produced by data-driven algorithms can be diminished if the data being analyzed are of low quality. The quality of data is often realized in terms of data noise present in the given dataset and can include noisy attributes or labeling errors. Hence, tools for improving the quality of data are important to the data mining analyst. We present a comprehensive empirical investigation of our new and innovative technique for ranking attributes in a given dataset from most to least noisy. Upon identifying the noisy attributes, specific treatments can be applied depending on how the data are to be used. In a classification setting, for example, if the class label is determined to contain the most noise, processes to cleanse this important attribute may be undertaken. Independent variables or predictors that have a low correlation to the class attribute and appear noisy may be eliminated from the analysis. Several case studies using both real-world and synthetic datasets are presented in this study. The noise detection performance is evaluated by injecting noise into multiple attributes at different noise levels. The empirical results demonstrate conclusively that our technique provides a very accurate and useful ranking of noisy attributes in a given dataset.

international conference on data mining | 2008

A Comparative Study of Data Sampling and Cost Sensitive Learning

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

Two common challenges data mining and machine learning practitioners face in many application domains are unequal classification costs and class imbalance. Most traditional data mining techniques attempt to maximize overall accuracy rather than minimize cost. When data is imbalanced, such techniques result in models that highly favor the over represented class, the class which typically carries a lower cost of misclassification. Two techniques that have been used to address both of these issues are cost sensitive learning and data sampling. In this work, we investigate the performance of two cost sensitive learning techniques and four data sampling techniques for minimizing classification costs when data is imbalanced. We present a comprehensive suite of experiments, utilizing 15 datasets with 10 cost ratios, which have been carefully designed to ensure conclusive, significant and reliable results.

world congress on computational intelligence | 2008

Software quality modeling: The impact of class noise on the random forest classifier

Andres Folleco; Taghi M. Khoshgoftaar; J. Van Hulse; Lofton A. Bullard

This study investigates the impact of increasing levels of simulated class noise on software quality classification. Class noise was injected into seven software engineering measurement datasets, and the performance of three learners, random forests, C4.5, and Naive Bayes, was analyzed. The random forest classifier was utilized for this study because of its strong performance relative to well-known and commonly-used classifiers such as C4.5 and Naive Bayes. Further, relatively little prior research in software quality classification has considered the random forest classifier. The experimental factors considered in this study were the level of class noise and the percent of minority instances injected with noise. The empirical results demonstrate that the random forest obtained the best and most consistent classification performance in all experiments.

Explore More