Felipe Viegas
Universidade Federal de Minas Gerais
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Felipe Viegas.
congress on evolutionary computation | 2012
Isac Sandin; Guilherme Andrade; Felipe Viegas; Daniel Madeira; Leonardo Cristian Rocha; Thiago Salles; Marcos André Gonçalves
One of the major challenges in automatic classification is to deal with highly dimensional data. Several dimensionality reduction strategies, including popular feature selection metrics such as Information Gain and χ2, have already been proposed to deal with this situation. However, these strategies are not well suited when the data is very skewed, a common situation in real-world data sets. This occurs when the number of samples in one class is much larger than the others, causing common feature selection metrics to be biased towards the features observed in the largest class. In this paper, we propose the use of Genetic Programming (GP) to implement an aggressive, yet very effective, selection of attributes. Our GP-based strategy is able to largely reduce dimensionality, while dealing effectively with skewed data. To this end, we exploit some of the most common feature selection metrics and, with GP, combine their results into new sets of features, obtaining a better unbiased estimate for the discriminative power of each feature. Our proposal was evaluated against each individual feature selection metric used in our GP-based solution (namely, Information Gain, χ2, Odds-Ratio, Correlation Coefficient) using a k8 cancer-rescue mutants data set, a very unbalanced collection referring to examples of p53 protein. For this data set, our solution not only increases the efficiency of the learning algorithms, with an aggressive reduction of the input space, but also significantly increases its accuracy.
symposium on computer architecture and high performance computing | 2013
Felipe Viegas; Guilherme Andrade; Jussara M. Almeida; Renato Ferreira; Marcos André Gonçalves; Gabriel Ramos; Leonardo C. da Rocha
The advent of the Web 2.0 has given rise to an interesting phenomenon: there is currently much more data than what can be effectively analyzed without relying on sophisticated automatic tools. Some of these tools, which target the organization and extraction of useful knowledge from this huge amount of data, rely on machine learning and data or text mining techniques, specifically automatic document classification algorithms. However, these algorithms are still a computational challenge because of the volume of data that needs to be processed. Some of the strategies available to address this challenge are based on the parallelization of ADC algorithms. In this work, we present GPU-NB, a parallel version of one of the most widely used document classification algorithms, the Naïve Bayes, that uses graphics processing units (GPUs). In our evaluation using 6 different document collections, we show that the GPU-NB can maintain the same classification effectiveness (in most cases) while increasing the efficiency by up to 34x faster than its sequential version using CPU. GPU-NB is also up to 11x faster than a CPU-based parallel implementation of Naive Bayes running with 4 threads. Moreover, assuming an optimistic behavior of the CPU parallelization, GPU-NB should outperform the CPU-based implementation with up to 32 cores, at a small fraction of the cost. We also show that the efficiency of the GPU-NB parallelization is impacted by features of the document collections, particularly the number of classes, although the density of the collection (average number of occurrences of terms per document) has a significant impact as well.
association for information science and technology | 2016
Thiago Salles; Leonardo Cristian Rocha; Marcos André Gonçalves; Jussara M. Almeida; Fernando Mourão; Wagner Meira; Felipe Viegas
Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well‐known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.
conference on information and knowledge management | 2015
Felipe Viegas; Marcos André Gonçalves; Wellington Santos Martins; Leonardo Cristian Rocha
Automatic Document Classification (ADC) is the basis of many important applications such as spam filtering and content organization. Naive Bayes (NB) approaches are a widely used classification paradigm, due to their simplicity, efficiency, absence of parameters and effectiveness. However, they do not present competitive effectiveness when compared to other modern statistical learning methods, such as SVMs. This is related to some characteristics of real document collections, such as class imbalance, feature sparseness and strong relationships among attributes. In this paper, we investigate whether the relaxation of the NB feature independence assumption (aka, Semi-NB approaches) can improve its effectiveness in large text collections. We propose four new Lazy Semi-NB strategies that exploit different ideas for alleviating the NB independence assumption. By being lazy, our solutions focus only on the most important features to classify a given test document, overcoming some Semi-NB issues when applied to ADC such as bias towards larger classes and overfitting and/or lack of generalization of the models. We demonstrate that our Lazy Semi-NB proposals can produce superior effectiveness when compared to state-of-the-art ADC classifiers such as SVM and KNN. Moreover, to overcome some efficiency issues of combining Semi-NB and lazy strategies, we take advantage of current manycore GPU architectures and present a massively parallelized version of the Semi-NB approaches. Our experimental results show that speedups of up to 63.36 times can be obtained when compared to serial solutions, making our proposals very practical in real-situations.
Neurocomputing | 2018
Felipe Viegas; Leonardo Cristian Rocha; Marcos André Gonçalves; Fernando Mourão; Giovanni Sá; Thiago Salles; Guilherme Andrade; Isac Sandin
High dimensionality, also known as the curse of dimensionality, is still a major challenge for automatic classification solutions. Accordingly, several feature selection (FS) strategies have been proposed for dimensionality reduction over the years. However, they potentially perform poorly in face of unbalanced data. In this work, we propose a novel feature selection strategy based on Genetic Programming, which is resilient to data skewness issues, in other words, it works well with both, balanced and unbalanced data. The proposed strategy aims at combining the most discriminative feature sets selected by distinct feature selection metrics in order to obtain a more effective and impartial set of the most discriminative features, departing from the hypothesis that distinct feature selection metrics produce different (and potentially complementary) feature space projections. We evaluated our proposal in biological and textual datasets. Our experimental results show that our proposed solution not only increases the efficiency of the learning process, reducing up to 83% the size of the data space, but also significantly increases its effectiveness in some scenarios.
Information Systems | 2017
Thiago Salles; Leonardo Cristian Rocha; Fernando Mourão; Marcos André Gonçalves; Felipe Viegas; Wagner Meira
Abstract One of the most relevant research topics in Information Retrieval is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets, reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previously introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF’s expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose a machine learning methodology to automatically learn the TWF without the need to perform any statistical tests. We also propose new strategies to incorporate the TWF into ADC algorithms, which we call temporally-aware classifiers . Experiments showed that the fully-automated temporally-aware classifiers achieved significant gains (up to 17%) when compared to their non-temporal counterparts, even outperforming some state-of-the-art algorithms (e.g., SVM) in most cases, with large reductions in execution time.
acm symposium on applied computing | 2015
Leonardo C. da Rocha; Gabriel Ramos; Rodrigo Chaves; Rafael Sachetto; Daniel Madeira; Felipe Viegas; Guilherme Andrade; Sérgio Daniel; Marcos André Gonçalves; Renato Ferreira
In nowadays we observe that there is more data than that can be effectively analyzed. Organizing this data has become one of the biggest problems in Computer Science. Many algorithms have been proposed for this purpose, highlighting those related to the Data Mining area, specifically the automatic document classification (ADC) algorithms. However, these algorithms are still a computational challenge because of the volume of data that needs to be processed. We found in the literature some proposals related to parallelization on graphics processing units (GPUs) to make these algorithms feasible. Still, most of the available parallel solutions ignore specific ADC challenges, such as high dimensionality and heterogeneity in the representation of the documents. In this context, we here present G-KNN, a GPU-based parallel version of the nearest neighbors algorithm (KNN), one of the most widely used ADC algorithms. In our evaluation using five different document collections, we show that the G-KNN can maintain the same classification effectiveness while increasing the efficiency by up to 12x faster than its sequential version using CPU and up to 3x faster than a CPU-based parallel implementation running with 6 threads. Moreover, our algorithm has a much lower memory consumption, enabling its use with large datasets.
conference on information and knowledge management | 2018
Felipe Viegas; Washington Luiz; Christian Gomes; Amir Khatibi; Sérgio D. Canuto; Fernando Mourão; Thiago Salles; Leonardo C. da Rocha; Marcos André Gonçalves
In this paper, we advance the state-of-the-art in topic modeling by means of the design and development of a novel (semi-formal) general topic modeling framework. The novel contributions of our solution include: (i) the introduction of new semantically-enhanced data representations for topic modeling based on pooling, and (ii) the proposal of a novel topic extraction strategy - ASToC - that solves the difficulty in representing topics in our semantically-enhanced information space. In our extensive experimentation evaluation, covering 12 datasets and 12 state-of-the-art baselines, totalizing 108 tests, we exceed (with a few ties) in almost 100 cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). We provide qualitative and quantitative statistical analyses of why our solutions work so well. Finally, we show that our method is able to improve document representation in automatic text classification.
Neurocomputing | 2018
Felipe Viegas; Leonardo C. da Rocha; Elaine Resende; Thiago Salles; Wellignton Martins; Mateus Ferreira e Freitas; Marcos André Gonçalves
Abstract Automatic Document Classification (ADC) has become the basis of many important applications, e.g., authorship identification, opinion mining, spam filtering, content organizers, etc. Due to their simplicity, efficiency, absence of parameters, and effectiveness in several scenarios, Naive Bayes (NB) approaches are widely used as a classification paradigm. Due to some characteristics of real document collections, e.g., class imbalance and feature sparseness, NB solutions do not present competitive effectiveness in some ADC tasks when compared to other supervised learning strategies, e.g., SVMs. In this article, we investigate whether a proper combination of some alternative NB learning models with different feature weighting techniques is able to improve the NB effectiveness in ADC tasks and verify that comparable or even superior results when compared to the state-of-the-art in ADC can be achieved. Moreover, we also present an investigation on the relaxation of the NB attribute independence assumption (aka, Semi-Naive approaches) in large text collections, something missing in the literature. Given the high computational costs of these investigations, we take advantage of current many core GPU and multi-GPU architectures to perform such investigation, presenting a massively parallelized version of the NB approach. Finally, supported by the parallel implementations, we propose four novel Lazy Semi-NB approaches to overcome potential overfitting problems. In our experiments, the new lazy solutions are not only more efficient and effective than existing Semi-NB approaches, but also surpass, in terms of effectiveness, all other alternatives in the majority of the cases.
Information Sciences | 2018
Fernando Mourão; Leonardo Cristian Rocha; Felipe Viegas; Thiago Salles; Marcos André Gonçalves; Srinivasan Parthasarathy; Wagner Meira
Abstract Aiming to handle the complexity inherent to the human textual communication, Automatic Document Classification (ADC) methods often adopt several simplifications. One such simplification is to consider independent the terms that compose documents, which may hide important relationships between them. These relationships can encapsulate non-trivial and effective patterns to improve classification effectiveness. In this work, we propose NetClass, a new network-based model for documents that explicitly considers term relationships and introduce a family of relational algorithms for ADC, such as the LRN-WRN classifier—a lazy relational ADC algorithm that not only exploits relationships between terms but also neighborhood information. As our extensive experimental evaluation shows, the proposed LRN-WRM achieves competitive performance when compared to the state-of-the-art in ADC, including SVM, considering seven distinct domains. More specifically, LRN-WRN outperforms state-of-the-art classifiers in 5 out of 7 domains, being within the top-2 best-performing classifier in all assessed domains. Our evaluation highlights the high effectiveness of our proposal, as well as its efficiency in terms of runtime. Indeed, besides effectiveness and efficiency, the simplicity and the absence of a complex parameter tuning of our proposal are key characteristics that make our algorithms interesting alternatives for ADC. Particularly, as highlighted by our experimental evaluation, LRN-WRM was shown to be a promising alternative to dynamic domains with a huge volume of short texts (e.g., social media content) or with several classes.