Rita P. Ribeiro | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rita P. Ribeiro is active.

Explore More

Publication

Featured researches published by Rita P. Ribeiro.

ACM Computing Surveys | 2016

A Survey of Predictive Modeling on Imbalanced Domains

Paula Branco; Luís Torgo; Rita P. Ribeiro

Many real-world data-mining applications involve obtaining predictive models using datasets with strongly imbalanced distributions of the target variable. Frequently, the least-common values of this target variable are associated with events that are highly relevant for end users (e.g., fraud detection, unusual returns on stock markets, anticipation of catastrophes, etc.). Moreover, the events may have different costs and benefits, which, when associated with the rarity of some of them on the available training data, creates serious problems to predictive modeling techniques. This article presents a survey of existing techniques for handling these important applications of predictive analytics. Although most of the existing work addresses classification tasks (nominal target variables), we also describe methods designed to handle similar problems within regression tasks (numeric target variables). In this survey, we discuss the main challenges raised by imbalanced domains, propose a definition of the problem, describe the main approaches to these tasks, propose a taxonomy of the methods, summarize the conclusions of existing comparative studies as well as some theoretical analyses of some methods, and refer to some related problems within predictive modeling.

european conference on principles of data mining and knowledge discovery | 2007

Utility-Based Regression

Luís Torgo; Rita P. Ribeiro

Cost-sensitive learning is a key technique for addressing many real world data mining applications. Most existing research has been focused on classification problems. In this paper we propose a framework for evaluating regression models in applications with non-uniform costs and benefits across the domain of the continuous target variable. Namely, we describe two metrics for asserting the costs and benefits of the predictions of any model given a set of test cases. We illustrate the use of our metrics in the context of a specific type of applications where non-uniform costs are required: the prediction of rare extreme values of a continuous target variable. Our experiments provide clear evidence of the utility of the proposed framework for evaluating the merits of any model in this class of regression domains.

portuguese conference on artificial intelligence | 2013

SMOTE for Regression

Luís Torgo; Rita P. Ribeiro; Bernhard Pfahringer; Paula Branco

Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal we have a problem of class imbalance that was already studied thoroughly within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by sampling approaches. These approaches change the distribution of the given training data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present a modification of the well-known Smote algorithm that allows its use on these regression tasks. In an extensive set of experiments we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed SmoteR method can be used with any existing regression algorithm turning it into a general tool for addressing problems of forecasting rare extreme values of a continuous target variable.

Expert Systems | 2015

Resampling strategies for regression

Luís Torgo; Paula Branco; Rita P. Ribeiro; Bernhard Pfahringer

Several real world prediction problems involve forecasting rare values of a target variable. When this variable is nominal, we have a problem of class imbalance that was thoroughly studied within machine learning. For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important applications involve forecasting rare extreme values of a continuous target variable. This paper describes a contribution to this type of tasks. Namely, we propose to address such tasks by resampling approaches that change the distribution of the given data set to decrease the problem of imbalance between the rare target cases and the most frequent ones. We present two modifications of well-known resampling strategies for classification tasks: the under-sampling and the synthetic minority over-sampling technique SMOTE methods. These modifications allow the use of these strategies on regression tasks where the goal is to forecast rare extreme values of the target variable. In an extensive set of experiments, we provide empirical evidence for the superiority of our proposals for these particular regression tasks. The proposed resampling methods can be used with any existing regression algorithm, which means that they are general tools for addressing problems of forecasting rare extreme values of a continuous target variable.

discovery science | 2009

Precision and Recall for Regression

Luís Torgo; Rita P. Ribeiro

Cost sensitive prediction is a key task in many real world applications. Most existing research in this area deals with classification problems. This paper addresses a related regression problem: the prediction of rare extreme values of a continuous variable. These values are often regarded as outliers and removed from posterior analysis. However, for many applications (e.g. in finance, meteorology, biology, etc.) these are the key values that we want to accurately predict. Any learning method obtains models by optimizing some preference criteria. In this paper we propose new evaluation criteria that are more adequate for these applications. We describe a generalization for regression of the concepts of precision and recall often used in classification. Using these new evaluation metrics we are able to focus the evaluation of predictive models on the cases that really matter for these applications. Our experiments indicate the advantages of the use of these new measures when comparing predictive models in the context of our target applications.

Archive | 2003

On the Road to Knowledge

Peter A. Flach; Hendrik Blockeel; Thomas Gärtner; Marko Grobelnik; Branko Kavsek; Martin Kejkula; Darek Krzywania; Nada Lavrač; Peter Ljubic; Dunja Mladenic; Steve Moyle; Stefan Raeymaekers; Jan Rauch; Simon Rawles; Rita P. Ribeiro; Gert Sclep; Jan Struyf; Ljupčo Todorovski; Luís Torgo; Dietrich Wettschereck; Shaomin Wu

In this chapter we describe our experience with mining a large multi-relational database of traffic accident reports. We applied a range of data mining techniques to this dataset, including text mining, clustering of time series, subgroup discovery, multi-relational data mining, and association rule learning. We also describe a collaborative data mining challenge on part of the dataset.

portuguese conference on artificial intelligence | 2003

Predicting Harmful Algae Blooms

Rita P. Ribeiro; Luís Torgo

In several applications the main interest resides in predicting rare and extreme values. This is the case of the prediction of harmful algae blooms. Though it’s rare, the occurrence of these blooms has a strong impact in river life forms and water quality and turns out to be a serious ecological problem. In this paper, we describe a data mining method whose main goal is to predict accurately this kind of rare extreme values. We propose a new splitting criterion for regression trees that enables the induction of trees achieving these goals. We carry out an analysis of the results obtained with our method on this application domain and compare them to those obtained with standard regression trees. We conclude that this new method achieves better results in terms of the evaluation statistics that are relevant for this kind of applications.

discovery science | 2014

Failure Prediction – An Application in the Railway Industry

Pedro Mota Pereira; Rita P. Ribeiro; João Gama

Machine or system failures have high impact both at technical and economic levels. Most modern equipment has logging systems that allow us to collect a diversity of data regarding their operation and health. Using data mining models for novelty detection enables us to explore those datasets, building classification systems that can detect and issue an alert when a failure starts evolving, avoiding the unknown development up to breakdown. In the present case we use a failure detection system to predict train doors breakdowns before they happen using data from their logging system. We study three methods for failure detection: outlier detection, novelty detection and a supervised SVM. Given the problem’s features, namely the possibility of a passenger interrupting the movement of a door, the three predictors are prone to false alarms. The main contribution of this work is the use of a low-pass filter to process the output of the predictors leading to a strong reduction in the false alarm rate.

discovery science | 2006

Rule-Based prediction of rare extreme values

Rita P. Ribeiro; Luís Torgo

This paper describes a rule learning method that obtains models biased towards a particular class of regression tasks. These tasks have as main distinguishing feature the fact that the main goal is to be accurate at predicting rare extreme values of the continuous target variable. Many real-world applications from scientific areas like ecology, meteorology, finance,etc., share this objective. Most existing approaches to regression problems search for the model parameters that optimize a given average error estimator (e.g. mean squared error). This means that they are biased towards achieving a good performance on the most common cases. The motivation for our work is the claim that being accurate at a small set of rare cases requires different error metrics. Moreover, given the nature and relevance of this type of applications an interpretable model is usually of key importance to domain experts, as predicting these rare events is normally associated with costly decisions. Our proposed system (R-PREV) obtains a set of interpretable regression rules derived from a set of bagged regression trees using evaluation metrics that bias the resulting models to predict accurately rare extreme values. We provide an experimental evaluation of our method confirming the advantages of our proposal in terms of accuracy in predicting rare extreme values.

Archive | 2016

Hierarchical Time Series Forecast in Electrical Grids

Vania Gomes de Almeida; Rita P. Ribeiro; João Gama

Hierarchical time series is a first order of importance topic. Effectively, there are several applications where time series can be naturally disaggregated in a hierarchical structure using attributes such as geographical location, product type, etc. Power networks face interesting problems related to its transition to computer-aided grids. Data can be naturally disaggregated in a hierarchical structure, and there is the possibility to look for both single and aggregated points along the grid. Along this work, we applied different hierarchical forecasting methods to them. Three different approaches are compared, two common approaches, bottom-up approach, top-down approach and another one based on the hierarchical structure of data, the optimal regression combination. The evaluation considers short-term forecasting (24-h ahead). Additionally, we discussed the importance associated to the correlation degree among series to improve forecasting accuracy. Our results demonstrated that the hierarchical approach outperforms bottom-up approach at intermediate/high levels. At lower levels, it presents a superior performance in less homogeneous substations, i. e. for the substations linked to different type of customers. Additionally, its performance is comparable to the top-down approach at top levels. This approach revealed to be an interesting tool for hierarchical data analysis. It allows to achieve a good performance at top levels as the top-down approach and at same time it allows to capture series dynamics at bottom levels as the bottom-up.

Explore More