Ahmad Abu Shanab
Florida Atlantic University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ahmad Abu Shanab.
information reuse and integration | 2012
Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald; Amri Napolitano
Feature selection is an important preprocessing step when learning from bioinformatics datasets. Since these datasets often have high dimensionality (a large number of features), selecting the most important ones both improves performance and reduces computation time. In addition, when the features in question are genes (as is the case for microarray datasets), knowing the important genes is useful on its own. Although many studies have examined feature selection in the context of classification performance, few analyze techniques in terms of stability: the ability of a technique to produce the same results (list of genes) regardless of changes to the dataset. In this study, we test the stability of eighteen feature rankers across four high-dimensional cancer-gene datasets. Because these datasets are also imbalanced (fewer positive-class instances than negative-class instances), we employ six versions of data sampling (three techniques with two class ratios). Finally, we inject artificial class noise to better evaluate how the rankers perform on realistic datasets, which can be prone to noise. The results demonstrate that of the rankers, the PRC- and Deviance-based Threshold-Based Feature Selection techniques, along with Signal-to-Noise, show the best stability on average. Results also demonstrate that among the sampling techniques investigated, Random Oversampling and the Synthetic Minority Oversampling Technique (both set to a 50:50 class ratio) showed the best performance on average.
information reuse and integration | 2011
Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald; Jason Van Hulse
Two of the most challenging problems in data mining are working with imbalanced datasets and with datasets which have a large number of attributes. In this study we compare three different approaches for handling both class imbalance and high dimensionality simultaneously. The first approach consists of sampling followed by feature selection, with the training data being built using the selected features and the original (unsampled) data. The second approach is similar, except that it uses the sampled data (and selected features) to build the training data. In the third approach, feature selection takes place before sampling, and the training data is based on the sampled data. To compare these three approaches, we use seven groups of datasets covering different application domains, employ nine feature rankers from three different families, and generate artificial class noise to better simulate real-world datasets. The results differ from an earlier work and show that the first and third approaches perform, on average, better than the second approach.
International Journal of Business Intelligence and Data Mining | 2012
Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald; Jason Van Hulse
Two problems often encountered in machine learning are class imbalance and high dimensionality. In this paper we compare three different approaches for addressing both problems simultaneously, by applying both data sampling and feature selection. With the first two approaches, sampling is followed by feature selection. In the first approach, the features are selected based on the sampled data, and then the unsampled data is used with just the selected features. The second approach is similar, but the sampled data is used. Finally, with the third approach, feature selection is performed prior to sampling. To compare the approaches, we use seven datasets from different domains, employ nine feature rankers from three different families, apply three sampling techniques, and inject class noise to better simulate real-world datasets. The results show that the second and third approaches are both very good, with the third approach showing a slight (but not statistically significant) lead.
international conference on machine learning and applications | 2011
Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald
High dimensionality is one of the major problems in data mining, occurring when there is a large abundance of attributes. One common technique used to alleviate high dimensionality is feature selection, the process of selecting the most relevant attributes and removing irrelevant and redundant ones. Much research has been done towards evaluating the performance of classifiers before and after feature selection, but little work has been done examining how sensitive the selected feature subsets are to changes (additions/deletions) in the dataset. In this study we evaluate the robustness of six commonly used feature selection techniques, investigating the impact of data sampling and class noise on the stability of feature selection. All experiments are carried out with six commonly used feature rankers on four groups of datasets from the biology domain. We employ three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results demonstrate that although no ranker consistently outperforms the others, Gain Ratio shows the least stability on average. Additional tests using our feature rankers for building classification models also show that a feature rankers stability is not an indicator of its performance in classification.
bioinformatics and biomedicine | 2012
Randall Wald; Taghi M. Khoshgoftaar; Ahmad Abu Shanab
Many biological datasets exhibit high dimensionality, a large abundance of attributes (genes) per instance (sample). This problem is often solved using feature selection, which works by selecting the most relevant attributes and removing irrelevant and redundant attributes. Although feature selection techniques are often evaluated based on the performance of classification models (e.g., algorithms designed to distinguish between multiple classes of instances, such as cancerous vs. noncancerous) built using the selected features, another important criterion which is often neglected is stability, the degree of agreement among a feature selection techniques outputs when there are changes to the dataset. More stable feature selection techniques will give the same features even if aspects of the data change. In this study we consider two different approaches for evaluating the stability of feature selection techniques, with each approach consisting of noise injection followed by feature ranking. The two approaches differ in that the first approach compares the features selected from the noisy datasets with the features selected from the original (clean) dataset, while the second approach performs pairwise comparisons among the results from the noisy datasets. To evaluate these two approaches, we use four biological datasets and employ six commonly-used feature rankers. We draw two primary conclusions from our experiments: First, the rankers show different levels of stability in the face of noise. In particular, the ReliefF ranker has significantly greater stability than the other rankers. Also, we found that both approaches gave the same results in terms of stability patterns, although the first approach had greater stability overall. Additionally, because the first approach is significantly less computationally expensive, future studies may employ a faster technique to gain the same results.
information reuse and integration | 2014
Randall Wald; Taghi M. Khoshgoftaar; Ahmad Abu Shanab
Many bioinformatics datasets suffer from noise, making it difficult to build reliable models. These datasets can also exhibit class imbalance (many more examples of the negative class than the positive class), which will also affect classification performance. It is not known how these two problems intersect: no previous study has considered to what extent the noise level (total quantity of noise) and noise distribution (amount of noise in each class) affect performance when considered at the same time. To explore this question, we injected artificial class noise into twelve clean bioinformatics datasets of varying levels of class imbalance (all of which were relatively easy to learn from), varying both the level and distribution of the noise. We discovered that when the number of noisy instances is less than or equal to 40% the total number of minority-class instances, the resulting noisy datasets (regardless of which classes suffered from noise injection) are nearly as easy to build models from as the original, clean data. However, with greater levels of noise injection, the distribution does matter, and in particular it matters in proportion to the imbalance of the original (clean) dataset. If the original dataset was mostly balanced, injecting noise into the minority class will not have much more effect than injecting into the majority class, but for highly imbalanced datasets, injecting into the minority class will give results much worse than those from injecting into the majority class.
information reuse and integration | 2014
Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald; Amri Napolitano
One of the main characteristics of bioinformatics datasets is noise. Noise refers to incorrect or missing values in a dataset and has a detrimental effect on classification. In this study we evaluate the robustness of six classification algorithms and ten filter-based feature selection techniques, specifically to study how the different techniques are impacted by particularly challenging datasets in order to find the techniques which are less sensitive to class noise. To investigate the robustness of the classification algorithms and feature selection techniques, we injected artificial noise (which varied both in terms of noise level and noise distribution) into 12 relatively noise-free bioinformatics datasets, creating a spectrum of noisy datasets with three levels of learning difficulty (Easy, Moderate, and Hard). We then used ten feature rankers from three different families, along with six classification techniques, to build predictive models. We found that the Random Forest 100 learner is the least sensitive to class noise, and thus is a good candidate for classification across all rankers and learning difficulty levels. Logistic Regression, on the other hand, gave the worst performance across all rankers and learning difficulty levels. Additionally, we found that some rankers were successful at ameliorating the difficulty of hard datasets for all learners other than Logistic Regression, meaning that with these rankers the datasets act as though they are less noisy.
information reuse and integration | 2015
Alireza Fazelpour; Taghi M. Khoshgoftaar; David J. Dittman; Ahmad Abu Shanab
Noise is a prominent challenge found in many bioinformatics datasets and it refers to erroneous or missing data. The presence of noise in gene expression datasets has adverse effects on machine-learning techniques, such as supervised classification algorithms and feature selection techniques. Additionally, the identification of noise and its quantification are challenging tasks that require a proper mechanism to manage them in order to improve the performance of classifiers and feature selection methods. In this study, our motivation is to investigate the effects of class noise on the classification performance of various learners using multiple derived datasets with varying degrees of data quality and class imbalance. Class imbalance is another challenging characteristic that occurs when one class has many more instances than the other class(es). To this end, we conducted experiments using a filter-based subset selection method applied to multiple derived datasets generated by injecting artificial class noise in a controlled manner creating three levels of data quality: High-Quality, Average-Quality, and Low-Quality. Our results along with statistical analysis show that Random Forest outperforms other learners without any exceptions for all levels of balance and data quality. Therefore, we recommend using Random Forest as the noise-tolerant and robust classifier when dealing with varying degrees of quality for bioinformatics datasets.
the florida ai research society | 2012
Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald
bioinformatics and bioengineering | 2014
Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald