Sérgio D. Canuto
Universidade Federal de Minas Gerais
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sérgio D. Canuto.
web search and data mining | 2016
Sérgio D. Canuto; Marcos André Gonçalves; Fabrício Benevenuto
In this paper we address the problem of automatically learning to classify the sentiment of short messages/reviews by exploiting information derived from meta-level features i.e., features derived primarily from the original bag-of-words representation. We propose new meta-level features especially designed for the sentiment analysis of short messages such as: (i) information derived from the sentiment distribution among the k nearest neighbors of a given short test document x, (ii) the distribution of distances of x to their neighbors and (iii) the document polarity of these neighbors given by unsupervised lexical-based methods. Our approach is also capable of exploiting information from the neighborhood of document x regarding (highly noisy) data obtained from 1.6 million Twitter messages with emoticons. The set of proposed features is capable of transforming the original feature space into a new one, potentially smaller and more informed. Experiments performed with a substantial number of datasets (nineteen) demonstrate that the effectiveness of the proposed sentiment-based meta-level features is not only superior to the traditional bag-of-word representation (by up to 16%) but is also superior in most cases to state-of-art meta-level features previously proposed in the literature for text classification tasks that do not take into account some idiosyncrasies of sentiment analysis. Our proposal is also largely superior to the best lexicon-based methods as well as to supervised combinations of them. In fact, the proposed approach is the only one to produce the best results in all tested datasets in all scenarios.
IEEE Transactions on Knowledge and Data Engineering | 2015
Guilherme Dal Bianco; Renata de Matos Galante; Marcos André Gonçalves; Sérgio D. Canuto; Carlos A. Heuser
The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. The information provided by the user to tune the deduplication process is usually represented by a set of manually labeled pairs. In very large datasets, producing this kind of labeled set is a daunting task since it requires an expert to select and label a large number of informative pairs. In this article, we propose a two-stage sampling selection strategy (T3S) that selects a reduced set of pairs to tune the deduplication process in large datasets. T3S selects the most representative pairs by following two stages. In the first stage, we propose a strategy to produce balanced subsets of candidate pairs for labeling. In the second stage, an active selection is incrementally invoked to remove the redundant pairs in the subsets created in the first stage in order to produce an even smaller and more informative training set. This training set is effectively used both to identify where the most ambiguous pairs lie and to configure the classification approaches. Our evaluation shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality when compared with state-of-the-art deduplication methods in large datasets.
international acm sigir conference on research and development in information retrieval | 2017
Raphael R. Campos; Sérgio D. Canuto; Thiago Salles; Clebson C. A. de Sá; Marcos André Gonçalves
Random Forest (RF) is one of the most successful strategies for automated classification tasks. Motivated by the RF success, recently proposed RF-based classification approaches leverage the central RF idea of aggregating a large number of low-correlated trees, which are inherently parallelizable and provide exceptional generalization capabilities. In this context, this work brings several new contributions to this line of research. First, we propose a new RF-based strategy (BERT) that applies the boosting technique in bags of extremely randomized trees. Second, we empirically demonstrate that this new strategy, as well as the recently proposed BROOF and LazyNN_RF classifiers do complement each other, motivating us to stack them to produce an even more effective classifier. Up to our knowledge, this is the first strategy to effectively combine the three main ensemble strategies: stacking, bagging (the cornerstone of RFs) and boosting. Finally, we exploit the efficient and unbiased stacking strategy based on out-of-bag (OOB) samples to considerably speedup the very costly training process of the stacking procedure. Our experiments in several datasets covering two high-dimensional and noisy domains of topic and sentiment classification provide strong evidence in favor of the benefits of our RF-based solutions. We show that BERT is among the top performers in the vast majority of analyzed cases, while retaining the unique benefits of RF classifiers (explainability, parallelization, easiness of parameterization). We also show that stacking only the recently proposed RF-based classifiers and BERT using our OOB-based strategy is not only significantly faster than recently proposed stacking strategies (up to six times) but also much more effective, with gains up to 21% and 17% on MacroF1 and MicroF1, respectively, over the best base method, and of 5% and 6% over a stacking of traditional methods, performing no worse than a complete stacking of methods at a much lower computational effort.
international acm sigir conference on research and development in information retrieval | 2015
Sérgio D. Canuto; Marcos André Gonçalves; Wisllay M. V. dos Santos; Thierson Couto Rosa; Wellington Santos Martins
The unprecedented growth of available data nowadays has stimulated the development of new methods for organizing and extracting useful knowledge from this immense amount of data. Automatic Document Classification (ADC) is one of such methods, that uses machine learning techniques to build models capable of automatically associating documents to well-defined semantic classes. ADC is the basis of many important applications such as language identification, sentiment analysis, recommender systems, spam filtering, among others. Recently, the use of meta-features has been shown to substantially improve the effectiveness of ADC algorithms. In particular, the use of meta-features that make a combined use of local information (through kNN-based features) and global information (through category centroids) has produced promising results. However, the generation of these meta-features is very costly in terms of both, memory consumption and runtime since there is the need to constantly call the kNN algorithm. We take advantage of the current manycore GPU architecture and present a massively parallel version of the kNN algorithm for highly dimensional and sparse datasets (which is the case for ADC). Our experimental results show that we can obtain speedup gains of up to 15x while reducing memory consumption in more than 5000x when compared to a state-of-the-art parallel baseline. This opens up the possibility of applying meta-features based classification in large collections of documents, that would otherwise take too much time or require the use of an expensive computational platform.
conference on information and knowledge management | 2014
Sérgio D. Canuto; Thiago Salles; Marcos André Gonçalves; Leonardo C. da Rocha; Gabriel Ramos; Luiz Alberto Oliveira Gonçalves; Thierson Couto Rosa; Wellington Santos Martins
This paper addresses the problem of automatically learning to classify texts by exploiting information derived from meta-level features (i.e., features derived from the original bag-of-words representation). We propose new meta-level features derived from the class distribution, the entropy and the within-class cohesion observed in the k nearest neighbors of a given test document x, as well as from the distribution of distances of x to these neighbors. The set of proposed features is capable of transforming the original feature space into a new one, potentially smaller and more informed. Experiments performed with several standard datasets demonstrate that the effectiveness of the proposed meta-level features is not only much superior than the traditional bag-of-word representation but also superior to other state-of-art meta-level features previously proposed in the literature. Moreover, the proposed meta-features can be computed about three times faster than the existing meta-level ones, making our proposal much more scalable. We also demonstrate that the combination of our meta features and the original set of features produce significant improvements when compared to each feature set used in isolation.
Information Sciences | 2017
Thiago N. C. Cardoso; Rodrigo M. Silva; Sérgio D. Canuto; Mirella M. Moro; Marcos André Gonçalves
Abstract We introduce a new paradigm for Ranked Batch-Mode Active Learning . It relaxes traditional Batch-Mode Active Learning (BMAL) methods by generating a query whose answer is an optimized ranked list of instances to be labeled, according to some quality criteria, allowing batches to be of arbitrarily large sizes. This new paradigm avoids the main problem of traditional BMAL, namely the frequent stops for manual labeling, reconciliation and model reconstruction. In this article, we formally define this problem and introduce a framework that iteratively and effectively builds the ranked list. Our experimental evaluation shows our proposed Ranked Batch approach significantly reduces the number of algorithm executions (and, consequently, the manual labeling delays) while maintaining or even improving the quality of the selected instances. In fact, when using only unlabeled data, our results are much better than those produced by pool-based batch-mode active learning methods that rely on already labeled seeds or update their models with labeled instances, with gains of up to 25% in MacroF1. Finally, our solutions are also more effective than density-sensitive active learning methods in most of the envisioned scenarios, as demonstrated by our experiments.
conference on information and knowledge management | 2018
Felipe Viegas; Washington Luiz; Christian Gomes; Amir Khatibi; Sérgio D. Canuto; Fernando Mourão; Thiago Salles; Leonardo C. da Rocha; Marcos André Gonçalves
In this paper, we advance the state-of-the-art in topic modeling by means of the design and development of a novel (semi-formal) general topic modeling framework. The novel contributions of our solution include: (i) the introduction of new semantically-enhanced data representations for topic modeling based on pooling, and (ii) the proposal of a novel topic extraction strategy - ASToC - that solves the difficulty in representing topics in our semantically-enhanced information space. In our extensive experimentation evaluation, covering 12 datasets and 12 state-of-the-art baselines, totalizing 108 tests, we exceed (with a few ties) in almost 100 cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). We provide qualitative and quantitative statistical analyses of why our solutions work so well. Finally, we show that our method is able to improve document representation in automatic text classification.
international conference theory and practice digital libraries | 2017
Gustavo Oliveira de Siqueira; Sérgio D. Canuto; Marcos André Gonçalves; Alberto H. F. Laender
Throughout the history of science, different knowledge areas have collaborated to overcome major research challenges. The task of associating a researcher with such areas makes a series of tasks feasible such as the organization of digital repositories, expertise recommendation and the formation of research groups for complex problems. In this paper we propose a simple yet effective automatic classification model that is capable of categorizing research expertise according to a hierarchical knowledge area classification scheme. Our proposal relies on discriminative evidence provided by the title of academic works, which is the minimum information capable of relating a researcher to its knowledge area. We also evaluate the use of learning-to-rank as an effective mean to rank experts with minimum information. Our experiments show that using supervised machine learning methods trained with manually labeled information, it is possible to produce effective classification and ranking models.
international conference on data engineering | 2016
Guilherme Dal Bianco; Renata de Matos Galante; Carlos A. Heuser; Marcos André Gonçalves; Sérgio D. Canuto
The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. The information provided by the user to tune the deduplication process is usually represented by a set of manually labeled pairs. In very large datasets, producing this kind of labeled set is a daunting task since it requires an expert to select and label a large number of informative pairs. In this article, we propose a two-stage sampling selection strategy (T3S) that selects a reduced set of pairs to tune the deduplication process in large datasets. T3S selects the most representative pairs by following two stages. In the first stage, we propose a strategy to produce balanced subsets of candidate pairs for labeling. In the second stage, an active selection is incrementally invoked to remove the redundant pairs in the subsets created in the first stage in order to produce an even smaller and more informative training set. This training set is effectively used both to identify where the most ambiguous pairs lie and to configure the classification approaches. Our evaluation shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality when compared with state-of-the-art deduplication methods in large datasets.
conference on information and knowledge management | 2016
Daniel Xavier de Sousa; Sérgio D. Canuto; Thierson Couto Rosa; Wellington Santos Martins; Marcos André Gonçalves
Learning to Rank (L2R) is currently an essential task in basically all types of information systems given the huge and ever increasing amount of data made available. While many solutions have been proposed to improve L2R functions, relatively little attention has been paid to the task of improving the quality of the feature space. L2R strategies usually rely on dense feature representations, which contain noisy or redundant features, increasing the cost of the learning process, without any benefits. Although feature selection (FS) strategies can be applied to reduce dimensionality and noise, side effects of such procedures have been neglected, such as the risk of getting very poor predictions in a few (but important) queries. In this paper we propose multi-objective FS strategies that optimize both aspects at the same time: ranking performance and risk-sensitive evaluation. For this, we approximate the Pareto-optimal set for multi-objective optimization in a new and original application to L2R. Our contributions include novel FS methods for L2R which optimize multiple, potentially conflicting, criteria. In particular, one of the objectives (risk-sensitive evaluation) has never been optimized in the context of FS for L2R before. Our experimental evaluation shows that our proposed methods select features that are more effective (ranking performance) and low-risk than those selected by other state-of-the-art FS methods.