David J. Dittman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David J. Dittman is active.

Explore More

Publication

Featured researches published by David J. Dittman.

information reuse and integration | 2012

A review of the stability of feature selection techniques for bioinformatics data

Wael Awada; Taghi M. Khoshgoftaar; David J. Dittman; Randall Wald; Amri Napolitano

Feature selection is an important step in data mining and is used in various domains including genetics, medicine, and bioinformatics. Choosing the important features (genes) is essential for the discovery of new knowledge hidden within the genetic code as well as the identification of important biomarkers. Although feature selection methods can help sort through large numbers of genes based on their relevance to the problem at hand, the results generated tend to be unstable and thus cannot be reproduced in other experiments. Relatedly, research interest in the stability of feature ranking methods has grown recently and researchers have produced experimental designs for testing the stability of feature selection, creating new metrics for measuring stability and new techniques designed to improve the stability of the feature selection process. In this paper, we will introduce the role of stability in feature selection with DNA microarray data. We list various ways of improving feature ranking stability, and discuss feature selection techniques, specifically explaining ensemble feature ranking and presenting various ensemble feature ranking aggregation methods. Finally, we discuss experimental procedures such as dataset perturbation, fixed overlap partitioning, and cross validation procedures that help researchers analyze and measure the stability of feature ranking methods. Throughout this work, we investigate current research in the field and discuss possible avenues of continuing such research efforts.

international conference on machine learning and applications | 2010

Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques

David J. Dittman; Taghi M. Khoshgoftaar; Randall Wald; Jason Van Hulse

One of today’s most important scientific research topics is discovering the genetic links between cancers. This paper contains the results of a comparison of three different cancers (breast, colon, and lung) based on the results of feature selection techniques on a data set created from DNA micro array data consisting of samples from all three cancers. The data was run through a set of eighteen feature rankers which ordered the genes by importance with respect to a targeted cancer. This process was repeated three times, each time with a different target cancer. The rankings were then compared, keeping each feature ranker static while varying the cancers being compared. The cancers were evaluated both in pairs and all together, for matching genes. The results of the comparison show a large correlation between the two known hereditary cancers, breast and colon, and little correlation between lung cancer and the other cancers. This is the first study to apply eighteen different feature rankers in a bioinformatics case study, eleven of which were recently proposed and implemented by our research team.

bioinformatics and biomedicine | 2011

Random forest: A reliable tool for patient response prediction

David J. Dittman; Taghi M. Khoshgoftaar; Randall Wald; Amri Napolitano

The goal of classification is to reliably identify instances that are members of the class of interest. This is especially important for predicting patient response to drugs. However, with high dimensional datasets, classification is both complicated and enhanced by the feature selection process. When designing a classification experiment there are a number of decisions which need to be made in order to maximize performance. These decisions are especially difficult for researchers in fields where data mining is not the focus, such as patient response prediction. It would be easier for such researchers to make these decisions if either their outcomes were chosen or their scope reduced, by using a learner which minimizes the impact of these decisions. We propose that Random Forest, a popular ensemble learner, can serve this role. We performed an experiment involving nineteen different feature selection rankers (eleven of which were proposed and implemented by our research team) to thoroughly test both the Random Forest learner and five other learners. Our research shows that, as long as a large enough number of features are used, the results of using Random Forest are favorable regardless of the choice of feature selection strategy, showing that Random Forest is a suitable choice for patient response prediction researchers who want to do not wish to choose from amongst a myriad of feature selection approaches.

international conference on machine learning and applications | 2012

First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques

Taghi M. Khoshgoftaar; David J. Dittman; Randall Wald; Alireza Fazelpour

Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.

bioinformatics and biomedicine | 2011

Stability Analysis of Feature Ranking Techniques on Biological Datasets

David J. Dittman; Taghi M. Khoshgoftaar; Randall Wald; Huanjing Wang

One major problem faced when analyzing DNA microarrays is their high dimensionality (large number of features). Therefore, feature selection is a necessary step when using these datasets. However, the addition or removal of instances can alter the subsets chosen by a feature selection technique. The ideal situation is to choose a feature selection technique that is robust (stable) to changes in the number of instances, with selected features changing little even when instances are added or removed. In this study we test the stability of nineteen feature selection techniques across twenty six datasets with varying levels of class imbalance. Our results show that the best choice of technique depends on the class balance of the datasets. The top performers are Deviance for balanced datasets, Signal to Noise for slightly imbalanced datasets, and AUC for imbalanced datasets. SVM-RFE was the least stable feature selection technique across the board, while other poor performers include Gain Ratio, Gini Index, Probability Ratio, and Power. We also found that enough changes to the dataset can make any feature selection technique unstable, and that using more features increases the stability of most feature selection techniques. Most intriguing was our finding that the more imbalanced a dataset is, the more stable the feature subsets built for that dataset will be. Overall, we conclude that stability is an important aspect of feature ranking which must be taken into account when planning a feature selection strategy or when adding or removing instances from a dataset.

information reuse and integration | 2012

An extensive comparison of feature ranking aggregation techniques in bioinformatics

Randall Wald; Taghi M. Khoshgoftaar; David J. Dittman; Wael Awada; Amri Napolitano

Univariate feature rankers have been frequently used to order genes (features) in terms of their importance to a given bioinformatics challenge. Unfortunately, the resulting feature subsets tend to differ when applied to related (but distinct) datasets, or when applied to datasets which have been varied or corrupted in some fashion. As a result, a research focus has recently been on methods to measure or improve the stability of these feature subsets. One such method is called rank aggregation. Rank aggregation is the process of combining the information from several ranked lists (or in this case ordered gene lists) into a single more stable list. While there has been work on the creation of these methods, very little work has gone into comparing the lists generated by these techniques. Such a comparison allows for grouping the techniques into families, both for understanding how the families affect rank aggregation and for using less-computationally-expensive members of a given family. This paper is an extensive study on nine rank aggregation techniques across twenty-six bioinformatics datasets. Our results show that certain aggregation techniques are very similar to each other, while others are quite unique in that they are not similar to the other techniques. Additionally, it was found that as the size of the feature subset increases, the similarity between the techniques increases. To our knowledge this is the first study which examines this many rank aggregation techniques within the domain of bioinformatics.

Archive | 2011

Feature Selection Algorithms for Mining High Dimensional DNA Microarray Data

David J. Dittman; Taghi M. Khoshgoftaar; Randall Wald; Jason Van Hulse

The World Heath Organization identified cancer as the second largest contributor to death worldwide, surpassed only by cardiovascular disease. The death count for cancer in 2002 was 7.1 million and is expected to rise to 11.5 million annually by 2030 [17]. In 2009, the International Conference on Machine Learning and Applications, or ICMLA, proposed a challenge regarding gene expression profiles in human cancers. The goal of the challenge was the “identification of functional clusters of genes from gene expression profiles in three major cancers: breast, colon and lung.” The identification of these clusters may further our understanding of cancer and open up new avenues of research.

international conference on machine learning and applications | 2013

Simplifying the Utilization of Machine Learning Techniques for Bioinformatics

David J. Dittman; Taghi M. Khoshgoftaar; Randall Wald; Amri Napolitano

The domain of bioinformatics has a number of challenges such as handling datasets which exhibit extreme levels of high dimensionality (large number of features per sample) and datasets which are particularly difficult to work with. These datasets contain many pieces of data (features) which are irrelevant and redundant to the problem being studied, which makes analysis quite difficult. However, techniques from the domain of machine learning and data mining are well suited to combating these difficulties. Techniques like feature selection (choosing an optimal subset of features for subsequent analysis by removing irrelevant or redundant features) and classifiers (used to build inductive models in order to classify unknown instances) can assist researchers in working with such difficult datasets. Unfortunately, many practitioners of bioinformatics do not have the machine learning knowledge to choose the correct techniques in order to achieve good classification results. If the choices could be simplified or predetermined then it would be easier to apply the techniques. This study is a comprehensive analysis of machine learning techniques on twenty-five bioinformatics datasets using six classifiers, and twenty-four feature rankers. We analyzed the factors at each of four feature subset sizes chosen for being large enough to be effective in creating inductive models but small enough to be of use for further research. Our results shows that Random Forest with 100 trees is the top performing classifier and that the choice of feature ranker is of little importance as long as feature selection occurs. Statistical analysis confirms our results. By choosing these parameters, machine learning techniques are more accessible to bioinformatics.

international conference on machine learning and applications | 2012

Comparing Two New Gene Selection Ensemble Approaches with the Commonly-Used Approach

David J. Dittman; Taghi M. Khoshgoftaar; Randall Wald; Amri Napolitano

Ensemble feature selection has recently become a topic of interest for researchers, especially in the area of bioinformatics. The benefits of ensemble feature selection include increased feature (gene) subset stability and usefulness as well as comparable (or better) classification performance compared to using a single feature selection method. However, existing work on ensemble feature selection has concentrated on data diversity (using a single feature selection method on multiple datasets or sampled data from a single dataset), neglecting two other potential sources of diversity. We present these two new approaches for gene selection, functional diversity (using multiple feature selection technique on a single dataset) and hybrid (a combination of data and functional diversity). To demonstrate the value of these new approaches, we measure the similarity between the feature subsets created by each of the three approaches across twenty-six datasets and ten feature selection techniques (or an ensemble of these techniques as appropriate). We also compare the classification performance of models built using each of the three ensembles. Our results show that the similarity between the functional diversity and hybrid approaches is much higher than the similarity between either of those and data diversity, with the distinction between data diversity and our new approaches being particularly strong for hard-to-learn datasets. In addition to having the highest similarity, functional and hybrid diversity generally show greater classification performance than data diversity, especially when selecting small feature subsets. These results demonstrate that these new approaches can both provide a different feature subset than the existing approach and that the resulting novel feature subset is potentially of interest to researchers. To our knowledge there has been no study which explores these new approaches to ensemble feature selection within the domain of bioinformatics.

international conference on machine learning and applications | 2012

Mean Aggregation versus Robust Rank Aggregation for Ensemble Gene Selection

Randall Wald; Taghi M. Khoshgoftaar; David J. Dittman

Feature (gene) selection is an important preprocessing step for performing data mining on large-scale bioinformatics datasets. However, one known concern is that feature selection can sometimes give very different results when applied to very similar data sets. Ensemble gene selection is a promising new approach which may help resolve this concern, producing more stable gene lists and better classification results. Ensemble selection consists of multiple runs of feature ranking which are then combined into a single ranking for each feature. However, one of the most critical decisions when performing ensemble gene selection is deciding on which aggregation technique to use for combining the resulting ranked feature lists from the multiple runs of feature ranking into a single decision for each gene. This paper is an in-depth comparison between two aggregation techniques: Mean Aggregation (a simple and commonly-used technique) and Robust Rank Aggregation (a recently proposed aggregation technique designed specifically for bioinformatics). Our results show that in general Mean Aggregation will outperform (or at least match) Robust Rank Aggregation in terms of classification performance, while being significantly simpler to implement and perform. These results allows us to recommend with reasonable confidence the use of Mean Aggregation over Robust Rank Aggregation.

Explore More