Alireza Fazelpour
Florida Atlantic University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alireza Fazelpour.
international conference on machine learning and applications | 2012
Taghi M. Khoshgoftaar; David J. Dittman; Randall Wald; Alireza Fazelpour
Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.
information reuse and integration | 2013
Taghi M. Khoshgoftaar; Alireza Fazelpour; Huanjing Wang; Randall Wald
With the proliferation of high-dimensional datasets across many application domains in recent years, feature selection has become an important data mining task due to its capability to improve both performance and computational efficiencies. The chosen feature subset is important not only due to its ability to improve classification performance, but also because in some domains, knowing the most important features is an end unto itself. In this latter case, one important property of a feature selection method is stability, which refers to insensitivity (robustness) of the selected features to small changes in the training dataset. In this survey paper, we discuss the problem of stability, its importance, and various stability measures used to evaluate feature subsets. We place special focus on the problem of stability as it applies to subset evaluation approaches (whether they are selected through filter-based subset techniques or wrapper-based subset selection techniques) as opposed to feature ranker stability, as subset evaluation stability leads to challenges which have been the subject of less research. We also discuss one domain of particular importance where subset evaluation (and the stability thereof) shows particular importance, but which has previously had relatively little attention for subset-based feature selection: Big Data which originates from bioinformatics.
information reuse and integration | 2013
Randall Wald; Taghi M. Khoshgoftaar; Alireza Fazelpour; David J. Dittman
Many bioinformatics datasets share certain problems: they have class imbalance (one class with many more instances than the remaining class(es)), or are difficult to learn from (build accurate models with). Much research has investigated these two problems, or even considered both at once. However, hidden dependencies can exist between these two problems: in a given collection of datasets, the highly imbalanced datasets may be particularly difficult or easy to learn from, and so conclusions based on the level of class imbalance may actually reflect the difficulty of learning. We present a case study with twenty-six bioinformatics datasets which exhibits this dependency, and highlights how it can result in misleading conclusions regarding the absolute and relative performance of learners and feature rankers across balance levels.
information reuse and integration | 2014
Taghi M. Khoshgoftaar; Alireza Fazelpour; David J. Dittman; Amri Napolitano
Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). Data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features; data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features; and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other hand, with the 35:65 class ratio, DS-FS-Sam was most frequently the top-performing approach, and although it was not statistically significantly better than the other approaches, we would generally recommend this approach be used for the 35:65 class ratio (although specific choices of learner and ranker may vary). Overall, we can see that the optimal approach will depend on the choice of class ratio.
bioinformatics and bioengineering | 2014
Taghi M. Khoshgoftaar; Alireza Fazelpour; David J. Dittman; Amri Napolitano
In the domain of bioinformatics, two common problems encountered when analyzing real-world datasets are class imbalance and high dimensionality. Boosting is a technique that can be used to improve classification performance, even in the presence of class imbalance. In addition, data sampling and feature selection are two important preprocessing techniques used to counter the adverse effects of both challenges collectively. In this study, we examine whether the inclusion of boosting along with joint deployment of feature selection and data sampling techniques affect the classification performance of inductive models. To this end, we used two approaches: filter-based feature selection followed by either data sampling (denoted as FS-DS) or a hybrid data sampling and boosting technique entitled RUSBoost (denoted as FRB) which integrates random under sampling within the boosting process. We conducted an extensive experimental study using six high dimensional and imbalanced bioinformatics datasets along with three learners and four feature subset sizes. Our results show that the improvement of classification performance due to boosting depends on the choice of learner used to build the model. We recommend FRB because it outperforms FS-DS for nearly all scenarios. Additionally, our ANOVA analysis shows that the FRB is statistically distinguishable from the FS-DS when using the LR learner. To our knowledge, this is the first study to investigate the effects of boosting along with combined feature selection and data sampling on classification performance of inductive models in the domain of bioinformatics.
bioinformatics and bioengineering | 2014
David J. Dittman; Taghi M. Khoshgoftaar; Amri Napolitano; Alireza Fazelpour
Bioinformatics datasets have historically been difficult to work with. However, within machine learning, there is a potentially effective tool to combat such problems: ensemble learning. Ensemble learning generates a series of models and combines their results to make a single decision. This process has the benefit of utilizing the power of multiple models but the overhead of having to compute the multiple models. Thus, we must ask whether the benefits outweigh the detriments. In this study, we seek to determine if the ensemble learning technique Select-Bagging improves classification results over feature selection on the training dataset followed by classification (denoted as FS-Classifier in this work) on a series of balanced bioinformatics datasets. We test the two approaches with two filter-based feature rankers, four feature subset sizes and the Naïve Bayes classifier. Our results show that Select-Bagging clearly outperforms FS-Classifier for nearly all scenarios. Subsequent statistical analysis shows that the increase in performance generated by Select-Bagging is statistically significantly better than FS-Classifier. Therefore, we can state that the inclusion of Select-Bagging is beneficial to the classification performance of models built on high-dimensional and balanced bioinformatics datasets and should be implemented. To our knowledge this is the first study which looks at the effectiveness of bagging in conjunction with internal feature selection for balanced bioinformatics datasets.
information reuse and integration | 2015
Taghi M. Khoshgoftaar; Alireza Fazelpour; David J. Dittman; Amri Napolitano
Class imbalance is a significant challenge that practitioners in the field of bioinformatics are faced with on a daily basis. It is a phenomenon that occurs when number of instances of one class is much greater than number of instances of the other class(es) and it has adverse effects on the performance of classification models built on this skewed data. Random Forest as a robust classifier has been utilized effectively to deal with challenging characteristics of imbalanced bioinformatics datasets. In this study, we seek the answer to the question, do alterations to the bootstrapping process within Random Forest improve its classification performance? Thus, we performed an experimental study using Random Forest with four bootstrapping approaches, including two new novel bootstrapping approaches, across 15 imbalanced bioinformatics datasets. Our results demonstrate that two of the bootstrapping approaches, including one of our proposed approaches, outperform other approaches, however, this difference is statistically insignificant. We conclude that Random Forest is a robust classifier, able to handle the challenge of class imbalance, and can be slightly improved by altering bootstrapping process. To the best of our knowledge, no previous work has studied the effects of multiple bootstrapping processes on the performance of Random Forest in the domain of bioinformatics. In addition, we proposed and implemented the two innovative bootstrapping approaches evaluated in this paper.
information reuse and integration | 2015
Brian Heredia; Taghi M. Khoshgoftaar; Alireza Fazelpour; David J. Dittman
Choosing an appropriate cancer treatment is potentially the most important task in the treatment of a cancer patient. If it were possible to identify the best option for a patient (or at minimum to remove options that will not help the patient), then the general prognosis of the patient improves. However, this task becomes much more subtle due to characteristics such as high dimensionality found in many gene expression datasets. In this study, we seek to identify classifiers and feature selection techniques best suited for predicting a breast cancer patients response to a cancer treatment. In order to determine this, we have collected a group of five high-dimensional breast cancer patient response datasets and use a group of four classifiers, and three feature selection techniques along with four feature subset sizes. Our results show that 5-Nearest Neighbor classifier and Signal-to-Noise feature selection technique are the most frequently top performing techniques. Statistical analysis confirms that these techniques are the top performing techniques. Thus, we recommend the use of 5-Nearest Neighbor and Signal-to-Noise for breast cancer patient response data. To our knowledge, this is the first study that focuses on the classification process on patient response data for breast cancer.
information reuse and integration | 2013
Randall Wald; Taghi M. Khoshgoftaar; Alireza Fazelpour
A major challenge facing data-mining practitioners in the field of bioinformatics is class imbalance, which occurs when instances of one class (called the majority class) vastly outnumber instances of the other (minority) classes. This can result in models with increased bias towards the majority class (minority-class instances predicted as being in the majority class). Data sampling, a process which changes the dataset through removing or adding instances to improve the class balance, can be used to improve the performance of such models on imbalanced data. However, it is not clear what target balance level should be used with data sampling, and what influence class imbalance alone has on classification performance (compared to other issues such as difficulty of learning from the data and dataset size). To resolve this, we propose the Balance-Aware Subsampling technique, which allows researchers to directly compare different balance levels of a dataset while keeping all other factors (such as dataset size and the actual dataset in question) constant. Thus, any changes in performance can be attributed solely to the chosen balance level. We demonstrate this technique using six datasets from the field of bioinformatics, and we also consider three different subsample sizes (that is, the size of the dataset used for building a model) so we can observe the effect of this parameter on classification performance. Our results show that within each level of class imbalance, the average AUC value increases as the subsample size increases. The key exception is the 20:80 (minority:majority) balance level, for which the average AUC value decreases as the subsample size increases from 80 to 120. We also find that within each subsample size, the average AUC value increases as the minority distribution increases, although this does not completely hold for subsample size 40 (in which case, the Näıve Bayes and Random Forest learners show greater performance at the 35:65 balance level than at 50:50), and in general there is not a significant improvement between the 35:65 and 50:50 balance levels. Overall, by using Balance-Aware Subsampling, we are able to directly observe how class imbalance affects performance isolated from all other factors.
information reuse and integration | 2016
Alireza Fazelpour; Taghi M. Khoshgoftaar; David J. Dittman; Amri Naplitano
Bagging ensemble techniques have been utilized effectively by practitioners in the field of bioinformatics to alleviate the problem of class imbalance and to improve the performance of classification models. However, many previous works have used bagging only with a single arbitrary number of iterations. In this study, we raise the question of what is the impact of altering the number of iterations/ensembles on the classification performance of bagging classifiers? To answer this question, we conducted an empirical study using four different choices of number of iterations (10, 20, 50, and 100) within the bagging algorithm, across 15 different imbalanced bioinformatics datasets. Our results indicate that the choice of 50 iterations performs slightly better than all others without any exception, but the difference in performance is statistically insignificant. Thus, we recommend bagging with 10 iterations because, it achieves quality classification results, additional iterations do not significantly improve performance, and, a smaller number of iterations would be computationally less costly. The unique contribution of this work is to examine the effects of the number of iterations on the classification performance of bagging classifiers in the context of imbalanced datasets in the bioinformatics field.