Jean Paul Barddal
Pontifícia Universidade Católica do Paraná
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jean Paul Barddal.
ACM Computing Surveys | 2017
Heitor Murilo Gomes; Jean Paul Barddal; Fabrício Enembreck; Albert Bifet
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integrated with drift detection algorithms and incorporate dynamic updates, such as selective removal or addition of classifiers. This work proposes a taxonomy for data stream ensemble learning as derived from reviewing over 60 algorithms. Important aspects such as combination, diversity, and dynamic updates, are thoroughly discussed. Additional contributions include a listing of popular open-source tools and a discussion about current data stream research challenges and how they relate to ensemble learning (big data streams, concept evolution, feature drifts, temporal dependencies, and others).
Machine Learning | 2017
Heitor Murilo Gomes; Albert Bifet; Jesse Read; Jean Paul Barddal; Fabrício Enembreck; Bernhard Pfharinger; Geoff Holmes; Talel Abdessalem
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
Journal of Systems and Software | 2017
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck; Bernhard Pfahringer
This paper provides insights into a nearly neglected type of drift: feature drifts.Existing works on feature drift detection and adaptation are surveyed.Existing works are empirically analyzed showing how challenging this problem is.This paper provides insights into future directions for research into feature drift detection and adaptation. Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation.
international conference on tools with artificial intelligence | 2015
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck
Mining data streams is of the utmost importance due to its appearance in many real-world situations, such as: sensor networks, stock market analysis and computer networks intrusion detection systems. Data streams are, by definition, potentially unbounded sequences of data that arrive intermittently at rapid rates. Extracting useful knowledge from data streams embeds virtually all problems from conventional data mining with the addition of single-pass real-time processing within limited time and memory space. Additionally, due to its ephemeral nature, it is expected that streams undergo changes in its data distribution denominated concept drifts. In this work, we focus on one specific kind of concept drift that has not been extensively addressed in the literature, namely feature drift. A feature drift happens when changes occur in the set of features, such that a subset of features become, or cease to be, relevant to the learning problem. Specifically, changes in the relevance of features directly imply modifications in the decision boundary to be learned, thus the learner must detect and adapt to according to it. Timely detection and recover from feature drifts is a challenging task that can be modeled after a dynamic feature selection problem. In this paper we survey existing work on dynamic feature selection for data streams that acts either implicitly or explicitly. We conclude that there is a need for future research in this area, which we highlight as future research directions.
acm symposium on applied computing | 2015
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck
Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally. On top of that, due to the inherent evolving nature of data streams, it is expected that these algorithms manage to quickly adapt to both concept drifts and the appearance and disappearance of clusters. Nevertheless, many of the developed two-step algorithms are only capable of finding hyper-spherical clusters and are highly dependant on parametrization. In this paper we introduce SNCStream, a one-step online clustering algorithm based on Social Networks Theory, which uses homophily to find non-hyper-spherical clusters. Our empirical studies show that SNCStream is able to surpass density-based algorithms in cluster quality and requires feasible amount of resources (time and memory) when compared to other algorithms.
acm symposium on applied computing | 2014
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck
In this paper, we present a new ensemble method, the Scale-free Network Classifier (SFNClassifier), that is conceived as a dynamic sized scale-free network. In Data Stream Mining, ensemble-based approaches have been proposed to enhance accuracy and allow fast recovery from concept drift. However, these approaches are based on both update and polling heuristics that do not present good accuracy results in arbitrary domains and do not represent explicitly the similarity between classifiers. The representation of the ensemble as a network allows us to extract centrality metrics, which are used to perform a weighted majority vote, where the weight of a classifier is proportional to its centrality value. Based on empirical studies, we concluded that SFNClassifier has comparable results to other ensemble-learners in terms of accuracy and outperformed the other methods in processing time.
european conference on machine learning | 2016
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck; Bernhard Pfahringer; Albert Bifet
The ubiquity of data streams has been encouraging the development of new incremental and adaptive learning algorithms. Data stream learners must be fast, memory-bounded, but mainly, tailored to adapt to possible changes in the data distribution, a phenomenon named concept drift. Recently, several works have shown the impact of a so far nearly neglected type of drifcccct: feature drifts. Feature drifts occur whenever a subset of features becomes, or ceases to be, relevant to the learning task. In this paper we (i) provide insights into how the relevance of features can be tracked as a stream progresses according to information theoretical Symmetrical Uncertainty; and (ii) how it can be used to boost two learning schemes: Naive Bayesian and k-Nearest Neighbor. Furthermore, we investigate the usage of these two new dynamically weighted learners as prediction models in the leaves of the Hoeffding Adaptive Tree classifier. Results show improvements in accuracy (an average of 10.69 % for k-Nearest Neighbor, 6.23 % for Naive Bayes and 4.42 % for Hoeffding Adaptive Trees) in both synthetic and real-world datasets at the expense of a bounded increase in both memory consumption and processing time.
Information Systems | 2016
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck; Jean-Paul A. Barthès
Abstract Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally as data arrives. On top of that, due to the inherent evolving nature of data streams, it is expected that algorithms undergo both concept drifts and evolutions, which must be taken into account by the clustering algorithm, allowing incremental clustering updates. In this paper we present the Social Network Clusterer Stream+ (SNCStream+). SNCStream+ tackles the data stream clustering problem as a network formation and evolution problem, where instances and micro-clusters form clusters based on homophily. Our proposal has its parameters analyzed and it is evaluated in a broad set of problems against literature baselines. Results show that SNCStream+ achieves superior clustering quality (CMM), and feasible processing time and memory space usage when compared to the original SNCStream and other proposals of the literature.
international conference on neural information processing | 2015
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck
Learning from data streams requires efficient algorithms capable of deriving a model accordingly to the arrival of new instances. Data streams are by definition unbounded sequences of data that are possibly non stationary, i.e. they may undergo changes in data distribution, phenomenon named concept drift. Concept drifts force streaming learning algorithms to detect and adapt to such changes in order to present feasible accuracy throughout time. Nonetheless, most of works presented in the literature do not account for a specific kind of drifts: feature drifts. Feature drifts occur whenever the relevance of an arbitrary attribute changes through time, also impacting the concept to be learned. In this paper we (i) verify the occurrence of feature drift in a publicly available dataset, (ii) present a synthetic data stream generator capable of performing feature drifts and (iii) analyze the impact of this type of drift in stream learning algorithms, enlightening that there is room and the need for dynamic feature selection strategies for data streams.
international conference on neural information processing | 2015
Jean Paul Barddal; Heitor Murilo Gomes; Fabrício Enembreck
Data stream mining is an active area of research that poses challenging research problems. In the latter years, a variety of data stream clustering algorithms have been proposed to perform unsupervised learning using a two-step framework. Additionally, dealing with non-stationary, unbounded data streams requires the development of algorithms capable of performing fast and incremental clustering addressing time and memory limitations without jeopardizing clustering quality. In this paper we present CNDenStream, a one-step data stream clustering algorithm capable of finding non-hyper-spherical clusters which, in opposition to other data stream clustering algorithms, is able to maintain updated clusters after the arrival of each instance by using a complex network construction and evolution model based on homophily. Empirical studies show that CNDenStream is able to surpass other algorithms in clustering quality and requires a feasible amount of resources when compared to other algorithms presented in the literature.