Gustavo E. A. P. A. Batista

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gustavo E. A. P. A. Batista is active.

Explore More

Publication

Featured researches published by Gustavo E. A. P. A. Batista.

Sigkdd Explorations | 2004

A study of the behavior of several methods for balancing machine learning training data

Gustavo E. A. P. A. Batista; Ronaldo C. Prati; Maria Carolina Monard

There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have difficulties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Two of our proposed methods, Smote + Tomek and Smote + ENN, presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of the decision trees induced from over-sampled data. Our results show that these trees are usually more complex then the ones induced from original data. Random over-sampling usually produced the smallest increase in the mean number of induced rules and Smote + ENN the smallest increase in the mean number of conditions per rule, when compared among the investigated over-sampling methods.

Applied Artificial Intelligence | 2003

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

Gustavo E. A. P. A. Batista; Maria Carolina Monard

One relevant problem in data quality is missing data. Despite the frequent occurrence and the relevance of the missing data problem, many machine learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully treated, otherwise bias might be introduced into the knowledge induced. In this work, we analyze the use of the k-nearest neighbor as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set with some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. Our analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperform the mean or mode imputation method, which is a method broadly used to treat missing values.

knowledge discovery and data mining | 2012

Searching and mining trillions of time series subsequences under dynamic time warping

Thanawin Rakthanmanon; Bilson J. L. Campana; Abdullah Mueen; Gustavo E. A. P. A. Batista; M. Brandon Westover; Qiang Zhu; Jesin Zakaria; Eamonn J. Keogh

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

mexican international conference on artificial intelligence | 2004

Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior

Ronaldo C. Prati; Gustavo E. A. P. A. Batista; Maria Carolina Monard

Several works point out class imbalance as an obstacle on applying machine learning algorithms to real world domains. However, in some cases, learning algorithms perform well on several imbalanced domains. Thus, it does not seem fair to directly correlate class imbalance to the loss of performance of learning algorithms. In this work, we develop a systematic study aiming to question whether class imbalances are truly to blame for the loss of performance of learning systems or whether the class imbalances are not a problem by themselves. Our experiments suggest that the problem is not directly caused by class imbalances, but is also related to the degree of overlapping among the classes.

Data Mining and Knowledge Discovery | 2014

CID: an efficient complexity-invariant distance for time series

Gustavo E. A. P. A. Batista; Eamonn J. Keogh; Oben M. Tataw; Vinícius Mourão Alves de Souza

The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.

IEEE Transactions on Knowledge and Data Engineering | 2011

A Survey on Graphical Methods for Classification Predictive Performance Evaluation

Ronaldo C. Prati; Gustavo E. A. P. A. Batista; Maria Carolina Monard

Predictive performance evaluation is a fundamental issue in design, development, and deployment of classification systems. As predictive performance evaluation is a multidimensional problem, single scalar summaries such as error rate, although quite convenient due to its simplicity, can seldom evaluate all the aspects that a complete and reliable evaluation must consider. Due to this, various graphical performance evaluation methods are increasingly drawing the attention of machine learning, data mining, and pattern recognition communities. The main advantage of these types of methods resides in their ability to depict the trade-offs between evaluation aspects in a multidimensional space rather than reducing these aspects to an arbitrarily chosen (and often biased) single scalar measure. Furthermore, to appropriately select a suitable graphical method for a given task, it is crucial to identify its strengths and weaknesses. This paper surveys various graphical methods often used for predictive performance evaluation. By presenting these methods in the same framework, we hope this paper may shed some light on deciding which methods are more suitable to use in different situations.

mexican international conference on artificial intelligence | 2000

Applying One-Sided Selection to Unbalanced Datasets

Gustavo E. A. P. A. Batista; André Carlos Ponce Leon Ferreira de Carvalho; Maria Carolina Monard

Several aspects may influence the performance achieved by a classifier created by a Machine Learning system. One of these aspects is related to the difference between the number of examples belonging to each class. When the difference is large, the learning system may have difficulties to learn the concept related to the minority class. In this work, we discuss some methods to decrease the number of examples belonging to the majority class, in order to improve the performance of the minority class. We also propose the use of the VDM metric in order to improve the performance of the classification techniques. Experimental application in a real world dataset confirms the efficiency of the proposed methods.

intelligent data analysis | 2005

Balancing strategies and class overlapping

Gustavo E. A. P. A. Batista; Ronaldo C. Prati; Maria Carolina Monard

Several studies have pointed out that class imbalance is a bottleneck in the performance achieved by standard supervised learning systems. However, a complete understanding of how this problem affects the performance of learning is still lacking. In previous work we identified that performance degradation is not solely caused by class imbalances, but is also related to the degree of class overlapping. In this work, we conduct our research a step further by investigating sampling strategies which aim to balance the training set. Our results show that these sampling strategies usually lead to a performance improvement for highly imbalanced data sets having highly overlapped classes. In addition, over-sampling methods seem to outperform under-sampling methods.

knowledge discovery and data mining | 2013

DTW-D: time series semi-supervised learning from a single example

Yanping Chen; Bing Hu; Eamonn J. Keogh; Gustavo E. A. P. A. Batista

Classification of time series data is an important problem with applications in virtually every scientific endeavor. The large research community working on time series classification has typically used the UCR Archive to test their algorithms. In this work we argue that the availability of this resource has isolated much of the research community from the following reality, labeled time series data is often very difficult to obtain. The obvious solution to this problem is the application of semi-supervised learning; however, as we shall show, direct applications of off-the-shelf semi-supervised learning algorithms do not typically work well for time series. In this work we explain why semi-supervised learning algorithms typically fail for time series problems, and we introduce a simple but very effective fix. We demonstrate our ideas on diverse real word problems.

brazilian symposium on artificial intelligence | 2004

Learning with Class Skews and Small Disjuncts

Ronaldo C. Prati; Gustavo E. A. P. A. Batista; Maria Carolina Monard

One of the main objectives of a Machine Learning – ML – system is to induce a classifier that minimizes classification errors. Two relevant topics in ML are the understanding of which domain characteristics and inducer limitations might cause an increase in misclassification. In this sense, this work analyzes two important issues that might influence the performance of ML systems: class imbalance and error-prone small disjuncts. Our main objective is to investigate how these two important aspects are related to each other. Aiming at overcoming both problems we analyzed the behavior of two over-sampling methods we have proposed, namely Smote + Tomek links and Smote + ENN. Our results suggest that these methods are effective for dealing with class imbalance and, in some cases, might help in ruling out some undesirable disjuncts. However, in some cases a simpler method, Random over-sampling, provides compatible results requiring less computational resources.

Explore More