Gregory Ditzler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gregory Ditzler is active.

Explore More

Publication

Featured researches published by Gregory Ditzler.

IEEE Transactions on Knowledge and Data Engineering | 2013

Incremental Learning of Concept Drift from Streaming Imbalanced Data

Gregory Ditzler; Robi Polikar

Learning in nonstationary environments, also known as learning concept drift, is concerned with learning from data whose statistical characteristics change over time. Concept drift is further complicated if the data set is class imbalanced. While these two issues have been independently addressed, their joint treatment has been mostly underexplored. We describe two ensemble-based approaches for learning concept drift from imbalanced data. Our first approach is a logical combination of our previously introduced Learn++.NSE algorithm for concept drift, with the well-established SMOTE for learning from imbalanced data. Our second approach makes two major modifications to Learn++.NSE-SMOTE integration by replacing SMOTE with a subensemble that makes strategic use of minority class data; and replacing Learn++.NSE and its class-independent error weighting mechanism with a penalty constraint that forces the algorithm to balance accuracy on all classes. The primary novelty of this approach is in determining the voting weights for combining ensemble members, based on each classifiers time and imbalance-adjusted accuracy on current and past environments. Favorable results in comparison to other approaches indicate that both approaches are able to address this challenging problem, each with its own specific areas of strength. We also release all experimental data as a resource and benchmark for future research.

2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE) | 2011

Hellinger distance based drift detection for nonstationary environments

Gregory Ditzler; Robi Polikar

Most machine learning algorithms, including many online learners, assume that the data distribution to be learned is fixed. There are many real-world problems where the distribution of the data changes as a function of time. Changes in nonstationary data distributions can significantly reduce the generalization ability of the learning algorithm on new or field data, if the algorithm is not equipped to track such changes. When the stationary data distribution assumption does not hold, the learner must take appropriate actions to ensure that the new/relevant information is learned. On the other hand, data distributions do not necessarily change continuously, necessitating the ability to monitor the distribution and detect when a significant change in distribution has occurred. In this work, we propose and analyze a feature based drift detection method using the Hellinger distance to detect gradual or abrupt changes in the distribution.

international conference on pattern recognition | 2010

An Incremental Learning Algorithm for Non-stationary Environments and Class Imbalance

Gregory Ditzler; Robi Polikar; Nitesh V. Chawla

Learning in a non-stationary environment and in the presence of class imbalance has been receiving more recognition from the computational intelligence community, but little work has been done to create an algorithm or a framework that can handle both issues simultaneously. We have recently introduced a new member to the Learn++ family of algorithms, Learn++.NSE, which is designed to track non-stationary environments. However, this algorithm does not work well when there is class imbalance as it has not been designed to handle this problem. On the other hand, SMOTE – a popular algorithm that can handle class imbalance – is not designed to learn in nonstationary environments because it is a method of over sampling the data. In this work we describe and present preliminary results for integrating SMOTE and Learn++.NSE to create an algorithm that is robust to learning in a non-stationary environment and under class imbalance.

international symposium on neural networks | 2011

Semi-supervised learning in nonstationary environments

Gregory Ditzler; Robi Polikar

Learning in nonstationary environments, also called learning concept drift, has been receiving increasing attention due to increasingly large number of applications that generate data with drifting distributions. These applications are usually associated with streaming data, either online or in batches, and concept drift algorithms are trained to detect and track the drifting concepts. While concept drift itself is a significantly more complex problem than the traditional machine learning paradigm of data coming from a fixed distribution, the problem is further complicated when obtaining labeled data is expensive, and training must rely, in part, on unlabelled data. Independently from concept drift research, semi-supervised approaches have been developed for learning from (limited) labeled and (abundant) unlabeled data; however, such approaches have been largely absent in concept drift literature. In this contribution, we describe an ensemble of classifiers based approach that takes advantage of both labeled and unlabeled data in addressing concept drift: available labeled data are used to generate classifiers, whose voting weights are determined based on the distances between Gaussian mixture model components trained on both labeled and unlabeled data in a drifting environment.

international symposium on neural networks | 2010

An ensemble based incremental learning framework for concept drift and class imbalance

Gregory Ditzler; Robi Polikar

We have recently introduced an incremental learning algorithm, Learn++.NSE, designed to learn in nonstationary environments, and has been shown to provide an attractive solution to a number of concept drift problems under different drift scenarios. However, Learn++.NSE relies on error to weigh the classifiers in the ensemble on the most recent data. For balanced class distributions, this approach works very well, but when faced with imbalanced data, error is no longer an acceptable measure of performance. On the other hand, the well-established SMOTE algorithm can address the class imbalance issue, however, it cannot learn in nonstationary environments. While there is some literature available for learning in nonstationary environments and imbalanced data separately, the combined problem of learning from imbalanced data coming from nonstationary environments is underexplored. Therefore, in this work we propose two modified frameworks for an algorithm that can be used to incrementally learn from imbalanced data coming from a nonstationary environment.

international conference on multiple classifier systems | 2010

Incremental learning of new classes in unbalanced datasets: Learn ++ .UDNC

Gregory Ditzler; Michael D. Muhlbaier; Robi Polikar

We have previously described an incremental learning algorithm, Learn++.NC, for learning from new datasets that may include new concept classes without accessing previously seen data. We now propose an extension, Learn++.UDNC, that allows the algorithm to incrementally learn new concept classes from unbalanced datasets. We describe the algorithm in detail, and provide some experimental results on two separate representative scenarios (on synthetic as well as real world data) along with comparisons to other approaches for incremental and/or unbalanced dataset approaches.

IEEE Transactions on Nanobioscience | 2015

Multi-Layer and Recursive Neural Networks for Metagenomic Classification

Gregory Ditzler; Robi Polikar; Gail Rosen

Recent advances in machine learning, specifically in deep learning with neural networks, has made a profound impact on fields such as natural language processing, image classification, and language modeling; however, feasibility and potential benefits of the approaches to metagenomic data analysis has been largely under-explored. Deep learning exploits many layers of learning nonlinear feature representations, typically in an unsupervised fashion, and recent results have shown outstanding generalization performance on previously unseen data. Furthermore, some deep learning methods can also represent the structure in a data set. Consequently, deep learning and neural networks may prove to be an appropriate approach for metagenomic data. To determine whether such approaches are indeed appropriate for metagenomics, we experiment with two deep learning methods: i) a deep belief network, and ii) a recursive neural network, the latter of which provides a tree representing the structure of the data. We compare these approaches to the standard multi-layer perceptron, which has been well-established in the machine learning community as a powerful prediction algorithm, though its presence is largely missing in metagenomics literature. We find that traditional neural networks can be quite powerful classifiers on metagenomic data compared to baseline methods, such as random forests. On the other hand, while the deep learning approaches did not result in improvements to the classification accuracy, they do provide the ability to learn hierarchical representations of a data set that standard classification methods do not allow. Our goal in this effort is not to determine the best algorithm in terms accuracy-as that depends on the specific application-but rather to highlight the benefits and drawbacks of each of the approach we discuss and provide insight on how they can be improved for predictive metagenomic analysis.

IEEE Transactions on Neural Networks | 2015

A Bootstrap Based Neyman-Pearson Test for Identifying Variable Importance

Gregory Ditzler; Robi Polikar; Gail Rosen

Selection of most informative features that leads to a small loss on future data are arguably one of the most important steps in classification, data analysis and model selection. Several feature selection (FS) algorithms are available; however, due to noise present in any data set, FS algorithms are typically accompanied by an appropriate cross-validation scheme. In this brief, we propose a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any FS algorithm, regardless of the FS criteria used by that algorithm, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point. We provide freely available software implementations of the proposed methodology.

BMC Bioinformatics | 2015

Fizzy. Feature subset selection for metagenomics

Gregory Ditzler; J. Calvin Morrison; Yemin Lan; Gail Rosen

BackgroundSome of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Feature subset selection – a sub-field of machine learning – can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome.ResultsWe have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets.ConclusionsWe have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.

international conference on bioinformatics | 2012

Information theoretic feature selection for high dimensional metagenomic data

Gregory Ditzler; Gail Rosen; Robi Polikar

Extremely high dimensional data sets are common in genomic classification scenarios, but they are particularly prevalent in metagenomic studies that represent samples as abundances of taxonomic units. Furthermore, the data dimensionality is typically much larger than the number of observations collected for each instance, a phenomenon known as curse of dimensionality, a particularly challenging problem for most machine learning algorithms. The biologists collecting and analyzing data need efficient methods to determine relationships between classes in a data set and the variables that are capable of differentiating between multiple groups in a study. The most common methods of metagenomic data analysis are those characterized by α- and β-diversity tests; however, neither of these tests allow scientists to identify the organisms that are most responsible for differentiating between different categories in a study. In this paper, we present an analysis of information theoretic feature selection methods for improving the classification accuracy with metagenomic data.

Explore More