João Gama | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where João Gama is active.

Explore More

Publication

Featured researches published by João Gama.

brazilian symposium on artificial intelligence | 2004

Learning with Drift Detection

João Gama; Pedro Medas; Gladys Castillo; Pedro Pereira Rodrigues

Most of the work in machine learning assume that examples are generated at random according to some stationary probability distribution. In this work we study the problem of learning when the distribution that generate the examples changes over time. We present a method for detection of changes in the probability distribution of examples. The idea behind the drift detection method is to control the online error-rate of the algorithm. The training examples are presented in sequence. When a new training example is available, it is classified using the actual model. Statistical theory guarantees that while the distribution is stationary, the error will decrease. When the distribution changes, the error will increase. The method controls the trace of the online error of the algorithm. For the actual context we define a warning level, and a drift level. A new context is declared, if in a sequence of examples, the error increases reaching the warning level at example k w , and the drift level at example k d . This is an indication of a change in the distribution of the examples. The algorithm learns a new model using only the examples since k w . The method was tested with a set of eight artificial datasets and a real world dataset. We used three learning algorithms: a perceptron, a neural network and a decision tree. The experimental results show a good performance detecting drift and with learning the new concept. We also observe that the method is independent of the learning algorithm.

ACM Computing Surveys | 2014

A survey on concept drift adaptation

João Gama; Indrė Žliobaitė; Albert Bifet; Mykola Pechenizkiy; Abdelhamid Bouchachia

Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.

intelligent data analysis | 2009

Knowledge discovery from data streams

João Gama; Auroop R. Ganguly; Olufemi A. Omitaomu; Ranga Raju Vatsavai; Mohamed Medhat Gaber

Since the beginning of the Internet age and the increased use of ubiquitous computing devices, the large volume and continuous flow of distributed data have imposed new constraints on the design of learning algorithms. Exploring how to extract knowledge structures from evolving and time-changing data, Knowledge Discovery from Data Streams presents a coherent overview of state-of-the-art research in learning from data streams. The book covers the fundamentals that are imperative to understanding data streams and describes important applications, such as TCP/IP traffic, GPS data, sensor networks, and customer click streams. It also addresses several challenges of data mining in the future, when stream mining will be at the core of many applications. These challenges involve designing useful and efficient data mining solutions applicable to real-world problems. In the appendix, the author includes examples of publicly available software and online data sets. This practical, up-to-date book focuses on the new requirements of the next generation of data mining. Although the concepts presented in the text are mainly about data streams, they also are valid for different areas of machine learning and data mining.

Archive | 2005

Knowledge Discovery in Databases: PKDD 2005

Alípio Mário Jorge; Luís Torgo; Pavel Brazdil; Rui Camacho; João Gama

Invited Talks.- Data Analysis in the Life Sciences - Sparking Ideas -.- Machine Learning for Natural Language Processing (and Vice Versa?).- Statistical Relational Learning: An Inductive Logic Programming Perspective.- Recent Advances in Mining Time Series Data.- Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce.- Data Streams and Data Synopses for Massive Data Sets.- Long Papers.- k-Anonymous Patterns.- Interestingness is Not a Dichotomy: Introducing Softness in Constrained Pattern Mining.- Generating Dynamic Higher-Order Markov Models in Web Usage Mining.- Tree 2 - Decision Trees for Tree Structured Data.- Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results.- Cluster Aggregate Inequality and Multi-level Hierarchical Clustering.- Ensembles of Balanced Nested Dichotomies for Multi-class Problems.- Protein Sequence Pattern Mining with Constraints.- An Adaptive Nearest Neighbor Classification Algorithm for Data Streams.- Support Vector Random Fields for Spatial Classification.- Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication.- A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns.- Improving Generalization by Data Categorization.- Mining Model Trees from Spatial Data.- Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification.- Mining Paraphrases from Self-anchored Web Sentence Fragments.- M2SP: Mining Sequential Patterns Among Several Dimensions.- A Systematic Comparison of Feature-Rich Probabilistic Classifiers for NER Tasks.- Knowledge Discovery from User Preferences in Conversational Recommendation.- Unsupervised Discretization Using Tree-Based Density Estimation.- Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization.- Non-stationary Environment Compensation Using Sequential EM Algorithm for Robust Speech Recognition.- Hybrid Cost-Sensitive Decision Tree.- Characterization of Novel HIV Drug Resistance Mutations Using Clustering, Multidimensional Scaling and SVM-Based Feature Ranking.- Object Identification with Attribute-Mediated Dependences.- Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids.- Using Inductive Logic Programming for Predicting Protein-Protein Interactions from Multiple Genomic Data.- ISOLLE: Locally Linear Embedding with Geodesic Distance.- Active Sampling for Knowledge Discovery from Biomedical Data.- A Multi-metric Index for Euclidean and Periodic Matching.- Fast Burst Correlation of Financial Data.- A Propositional Approach to Textual Case Indexing.- A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston.- Efficient Classification from Multiple Heterogeneous Databases.- A Probabilistic Clustering-Projection Model for Discrete Data.- Short Papers.- Collaborative Filtering on Data Streams.- The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-Based FIM Algorithms.- Community Mining from Multi-relational Networks.- Evaluating the Correlation Between Objective Rule Interestingness Measures and Real Human Interest.- A Kernel Based Method for Discovering Market Segments in Beef Meat.- Corpus-Based Neural Network Method for Explaining Unknown Words by WordNet Senses.- Segment and Combine Approach for Non-parametric Time-Series Classification.- Producing Accurate Interpretable Clusters from High-Dimensional Data.- Stress-Testing Hoeffding Trees.- Rank Measures for Ordering.- Dynamic Ensemble Re-Construction for Better Ranking.- Frequency-Based Separation of Climate Signals.- Efficient Processing of Ranked Queries with Sweeping Selection.- Feature Extraction from Mass Spectra for Classification of Pathological States.- Numbers in Multi-relational Data Mining.- Testing Theories in Particle Physics Using Maximum Likelihood and Adaptive Bin Allocation.- Improved Naive Bayes for Extremely Skewed Misclassification Costs.- Clustering and Prediction of Mobile User Routes from Cellular Data.- Elastic Partial Matching of Time Series.- An Entropy-Based Approach for Generating Multi-dimensional Sequential Patterns.- Visual Terrain Analysis of High-Dimensional Datasets.- An Auto-stopped Hierarchical Clustering Algorithm for Analyzing 3D Model Database.- A Comparison Between Block CEM and Two-Way CEM Algorithms to Cluster a Contingency Table.- An Imbalanced Data Rule Learner.- Improvements in the Data Partitioning Approach for Frequent Itemsets Mining.- On-Line Adaptive Filtering of Web Pages.- A Bi-clustering Framework for Categorical Data.- Privacy-Preserving Collaborative Filtering on Vertically Partitioned Data.- Indexed Bit Map (IBM) for Mining Frequent Sequences.- STochFS: A Framework for Combining Feature Selection Outcomes Through a Stochastic Process.- Speeding Up Logistic Model Tree Induction.- A Random Method for Quantifying Changing Distributions in Data Streams.- Deriving Class Association Rules Based on Levelwise Subspace Clustering.- An Incremental Algorithm for Mining Generators Representation.- Hybrid Technique for Artificial Neural Network Architecture and Weight Optimization.

knowledge discovery and data mining | 2003

Accurate decision trees for mining high-speed data streams

João Gama; Ricardo Rocha; Pedro Medas

In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algorithms. One of the most successful algorithms for mining data streams is VFDT. In this paper we extend the VFDT system in two directions: the ability to deal with continuous data and the use of more powerful classification techniques at tree leaves. The proposed system, VFDTc, can incorporate and classify new information online, with a single scan of the data, in time constant per example. The most relevant property of our system is the ability to obtain a performance similar to a standard decision tree algorithm even for medium size datasets. This is relevant due to the any-time property. We study the behaviour of VFDTc in different problems and demonstrate its utility in large and medium data sets. Under a bias-variance analysis we observe that VFDTc in comparison to C4.5 is able to reduce the variance component.

Machine Learning archive | 2000

Cascade Generalization

João Gama; Pavel Brazdil

Using multiple classifiers for increasing learning accuracy is an active research area. In this paper we present two related methods for merging classifiers. The first method, Cascade Generalization, couples classifiers loosely. It belongs to the family of stacking algorithms. The basic idea of Cascade Generalization is to use sequentially the set of classifiers, at each step performing an extension of the original data by the insertion of new attributes. The new attributes are derived from the probability class distribution given by a base classifier. This constructive step extends the representational language for the high level classifiers, relaxing their bias. The second method exploits tight coupling of classifiers, by applying Cascade Generalization locally. At each iteration of a divide and conquer algorithm, a reconstruction of the instance space occurs by the addition of new attributes. Each new attribute represents the probability that an example belongs to a class given by a base classifier. We have implemented three Local Generalization Algorithms. The first merges a linear discriminant with a decision tree, the second merges a naive Bayes with a decision tree, and the third merges a linear discriminant and a naive Bayes with a decision tree. All the algorithms show an increase of performance, when compared with the corresponding single models. Cascade also outperforms other methods for combining classifiers, like Stacked Generalization, and competes well against Boosting at statistically significant confidence levels.

knowledge discovery and data mining | 2009

Issues in evaluation of stream learning algorithms

João Gama; Raquel Sebastião; Pedro Pereira Rodrigues

Learning from data streams is a research area of increasing importance. Nowadays, several stream learning algorithms have been developed. Most of them learn decision models that continuously evolve over time, run in resource-aware environments, detect and react to changes in the environment generating data. One important issue, not yet conveniently addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. There are no golden standards for assessing performance in non-stationary environments. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of Predictive Sequential methods for error estimate - the prequential error. The prequential error allows us to monitor the evolution of the performance of models that evolve over time. Nevertheless, it is known to be a pessimistic estimator in comparison to holdout estimates. To obtain more reliable estimators we need some forgetting mechanism. Two viable alternatives are: sliding windows and fading factors. We observe that the prequential error converges to an holdout estimator when estimated over a sliding window or using fading factors. We present illustrative examples of the use of prequential error estimators, using fading factors, for the tasks of: i) assessing performance of a learning algorithm; ii) comparing learning algorithms; iii) hypothesis testing using McNemar test; and iv) change detection using Page-Hinkley test. In these tasks, the prequential error estimated using fading factors provide reliable estimators. In comparison to sliding windows, fading factors are faster and memory-less, a requirement for streaming applications. This paper is a contribution to a discussion in the good-practices on performance assessment when learning dynamic models that evolve over time.

ACM Computing Surveys | 2013

Data stream clustering: A survey

Jonathan de Andrade Silva; Elaine R. Faria; Rodrigo C. Barros; Eduardo R. Hruschka; André Carlos Ponce Leon Ferreira de Carvalho; João Gama

Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with nonstationary, unbounded data that arrive in an online fashion. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects, suitably addressing time and memory limitations. In this article, we present a survey of data stream clustering algorithms, providing a thorough discussion of the main design components of state-of-the-art algorithms. In addition, this work addresses the temporal aspects involved in data stream clustering, and presents an overview of the usually employed experimental methodologies. A number of references are provided that describe applications of data stream clustering in different domains, such as network intrusion detection, sensor networks, and stock market analysis. Information regarding software packages and data repositories are also available for helping researchers and practitioners. Finally, some important issues and open questions that can be subject of future research are discussed.

IEEE Transactions on Knowledge and Data Engineering | 2008

Hierarchical Clustering of Time-Series Data Streams

Pedro Pereira Rodrigues; João Gama; João Pedro Pedroso

This paper presents and analyzes an incremental system for clustering streaming time series. The Online Divisive-Agglomerative Clustering (ODAC) system continuously maintains a tree-like hierarchy of clusters that evolves with data, using a top-down strategy. The splitting criterion is a correlation-based dissimilarity measure among time series, splitting each node by the farthest pair of streams. The system also uses a merge operator that reaggregates a previously split node in order to react to changes in the correlation structure between time series. The split and merge operators are triggered in response to changes in the diameters of existing clusters, assuming that in stationary environments, expanding the structure leads to a decrease in the diameters of the clusters. The system is designed to process thousands of data streams that flow at a high rate. The main features of the system include update time and memory consumption that do not depend on the number of examples in the stream. Moreover, the time and memory required to process an example decreases whenever the cluster structure expands. Experimental results on artificial and real data assess the processing qualities of the system, suggesting a competitive performance on clustering streaming time series, exploring also its ability to deal with concept drift.

Machine Learning | 2013

On evaluating stream learning algorithms

João Gama; Raquel Sebastião; Pedro Pereira Rodrigues

Most streaming decision models evolve continuously over time, run in resource-aware environments, and detect and react to changes in the environment generating data. One important issue, not yet convincingly addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of prequential error with forgetting mechanisms to provide reliable error estimators. We prove that, in stationary data and for consistent learning algorithms, the holdout estimator, the prequential error and the prequential error estimated over a sliding window or using fading factors, all converge to the Bayes error. The use of prequential error with forgetting mechanisms reveals to be advantageous in assessing performance and in comparing stream learning algorithms. It is also worthwhile to use the proposed methods for hypothesis testing and for change detection. In a set of experiments in drift scenarios, we evaluate the ability of a standard change detection algorithm to detect change using three prequential error estimators. These experiments point out that the use of forgetting mechanisms (sliding windows or fading factors) are required for fast and efficient change detection. In comparison to sliding windows, fading factors are faster and memoryless, both important requirements for streaming applications. Overall, this paper is a contribution to a discussion on best practice for performance assessment when learning is a continuous process, and the decision models are dynamic and evolve over time.

Explore More