Ronaldo C. Prati | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ronaldo C. Prati is active.

Explore More

Publication

Featured researches published by Ronaldo C. Prati.

Sigkdd Explorations | 2004

A study of the behavior of several methods for balancing machine learning training data

Gustavo E. A. P. A. Batista; Ronaldo C. Prati; Maria Carolina Monard

There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have difficulties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Two of our proposed methods, Smote + Tomek and Smote + ENN, presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of the decision trees induced from over-sampled data. Our results show that these trees are usually more complex then the ones induced from original data. Random over-sampling usually produced the smallest increase in the mean number of induced rules and Smote + ENN the smallest increase in the mean number of conditions per rule, when compared among the investigated over-sampling methods.

mexican international conference on artificial intelligence | 2004

Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior

Ronaldo C. Prati; Gustavo E. A. P. A. Batista; Maria Carolina Monard

Several works point out class imbalance as an obstacle on applying machine learning algorithms to real world domains. However, in some cases, learning algorithms perform well on several imbalanced domains. Thus, it does not seem fair to directly correlate class imbalance to the loss of performance of learning algorithms. In this work, we develop a systematic study aiming to question whether class imbalances are truly to blame for the loss of performance of learning systems or whether the class imbalances are not a problem by themselves. Our experiments suggest that the problem is not directly caused by class imbalances, but is also related to the degree of overlapping among the classes.

IEEE Transactions on Knowledge and Data Engineering | 2011

A Survey on Graphical Methods for Classification Predictive Performance Evaluation

Ronaldo C. Prati; Gustavo E. A. P. A. Batista; Maria Carolina Monard

Predictive performance evaluation is a fundamental issue in design, development, and deployment of classification systems. As predictive performance evaluation is a multidimensional problem, single scalar summaries such as error rate, although quite convenient due to its simplicity, can seldom evaluate all the aspects that a complete and reliable evaluation must consider. Due to this, various graphical performance evaluation methods are increasingly drawing the attention of machine learning, data mining, and pattern recognition communities. The main advantage of these types of methods resides in their ability to depict the trade-offs between evaluation aspects in a multidimensional space rather than reducing these aspects to an arbitrarily chosen (and often biased) single scalar measure. Furthermore, to appropriately select a suitable graphical method for a given task, it is crucial to identify its strengths and weaknesses. This paper surveys various graphical methods often used for predictive performance evaluation. By presenting these methods in the same framework, we hope this paper may shed some light on deciding which methods are more suitable to use in different situations.

intelligent data analysis | 2005

Balancing strategies and class overlapping

Gustavo E. A. P. A. Batista; Ronaldo C. Prati; Maria Carolina Monard

Several studies have pointed out that class imbalance is a bottleneck in the performance achieved by standard supervised learning systems. However, a complete understanding of how this problem affects the performance of learning is still lacking. In previous work we identified that performance degradation is not solely caused by class imbalances, but is also related to the degree of class overlapping. In this work, we conduct our research a step further by investigating sampling strategies which aim to balance the training set. Our results show that these sampling strategies usually lead to a performance improvement for highly imbalanced data sets having highly overlapped classes. In addition, over-sampling methods seem to outperform under-sampling methods.

brazilian symposium on artificial intelligence | 2004

Learning with Class Skews and Small Disjuncts

Ronaldo C. Prati; Gustavo E. A. P. A. Batista; Maria Carolina Monard

One of the main objectives of a Machine Learning – ML – system is to induce a classifier that minimizes classification errors. Two relevant topics in ML are the understanding of which domain characteristics and inducer limitations might cause an increase in misclassification. In this sense, this work analyzes two important issues that might influence the performance of ML systems: class imbalance and error-prone small disjuncts. Our main objective is to investigate how these two important aspects are related to each other. Aiming at overcoming both problems we analyzed the behavior of two over-sampling methods we have proposed, namely Smote + Tomek links and Smote + ENN. Our results suggest that these methods are effective for dealing with class imbalance and, in some cases, might help in ruling out some undesirable disjuncts. However, in some cases a simpler method, Random over-sampling, provides compatible results requiring less computational resources.

international symposium on neural networks | 2012

Combining feature ranking algorithms through rank aggregation

Ronaldo C. Prati

The problem of combining multiple feature rankings into a more robust ranking is investigated. A general framework for ensemble feature ranking is proposed, alongside four instantiations of this framework using different ranking aggregation methods. An empirical evaluation using 39 UCI datasets, three different learning algorithms and three different performance measures enable us to reach a compelling conclusion: ensemble feature ranking do improve the quality of feature rankings. Furthermore, one of the proposed methods was able to achieve results statistically significantly better than the others.

international conference hybrid intelligent systems | 2005

Constructing ensembles of symbolic classifiers

Flávia Cristina Bernardini; Maria Carolina Monard; Ronaldo C. Prati

Learning algorithms are an integral part of the data mining (DM) process. However, DM deals with a large amount of data and most learning algorithms do not operate in massive datasets. A technique often used to ease this problem is related to data sampling and the construction of ensembles of classifiers. Several methods to construct such ensembles have been proposed. However, these methods often lack an explanation facility. This paper proposes methods to construct ensembles of symbolic classifiers. These ensembles can be further explored in order to explain their decisions to the user. These methods were implemented in the ELE system, also described in this work. Experimental results in two out of three datasets show improvement over all base-classifiers. Moreover, according to the obtained results, methods based on single rule classification might be used to improve the explanation facility of ensembles.

IEEE Latin America Transactions | 2008

Evaluating Classifiers Using ROC Curves

Ronaldo C. Prati; Gustavo E. A. P. A. Batista; Maria Carolina Monard

ROC charts have recently been introduced as a powerful tool for evaluation of learning systems. Although ROC charts are conceptually simple, there are some common misconceptions and pitfalls when used in practice. This work surveys ROC analysis, highlighting the advantages of its use in machine learning and data mining, and tries to clarify several erroneous interpretations related to its use.

Archive | 2018

Learning from Imbalanced Data Sets

Alberto Fernández; Salvador García; Mikel Galar; Ronaldo C. Prati; Bartosz Krawczyk; Francisco Herrera

Nowadays, the availability of large volumes of data and the widespread use of tools for the proper extraction of knowledge information has become very frequent, especially in large corporations. This fact has transformed the data analysis by orienting it towards certain specialized techniques included under the umbrella of Data Science. In summary, Data Science can be considered as a discipline for discovering new and significant relationships, patterns and trends in the examination of large amounts of data. Therefore, Data Science techniques pursue the automatic discovery of the knowledge contained in the information stored in large databases. These techniques aim to uncover patterns, profiles and trends through the analysis of data using reconnaissance technologies, such as clustering, classification, predictive analysis, association mining, among others. For this reason, we are witnessing the development of multiple software solutions for the treatment of data and integrating lots of Data Science algorithms. In order to better understand the nature of Data Science, this chapter is organized as follows. Sections 1.2 and 1.3 defines the Data Science terms and its workflow. Then, in Sect. 1.4 the standard problems in Data Science are introduced. Section 1.5 describes some standard data mining algorithms. Finally, in Sect. 1.6 some of the non-standard problems in Data Science are mentioned.

brazilian conference on intelligent systems | 2013

Complex Network Measures for Data Set Characterization

Gleison Morais; Ronaldo C. Prati

This paper investigates the adoption of measures used to evaluate complex networks properties in the characterization of the complexity of data sets in machine learning applications. These measures are obtained from a graph based representation of a data set. A graph representation has several interesting properties as it can encode local neighborhood relations, as well as global characteristics of the data. These measures are evaluated in a meta-learning framework, where the objective is to predict which classifier will have better performance in a given task, in a pair wise basis comparison, based on the complexity measures. Results were compared to traditional data set complexity characterization metrics, and shown the competitiveness of the proposed measures derived from the graph representation when compared to traditional complexity characterization metrics.

Explore More