Is this you? Create Your Porfile

Leonardo Cristian Rocha

Universidade Federal de São João del-Rei

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Leonardo Cristian Rocha is active.

Explore More

Publication

Featured researches published by Leonardo Cristian Rocha.

congress on evolutionary computation | 2012

Aggressive and effective feature selection using genetic programming

Isac Sandin; Guilherme Andrade; Felipe Viegas; Daniel Madeira; Leonardo Cristian Rocha; Thiago Salles; Marcos André Gonçalves

One of the major challenges in automatic classification is to deal with highly dimensional data. Several dimensionality reduction strategies, including popular feature selection metrics such as Information Gain and χ2, have already been proposed to deal with this situation. However, these strategies are not well suited when the data is very skewed, a common situation in real-world data sets. This occurs when the number of samples in one class is much larger than the others, causing common feature selection metrics to be biased towards the features observed in the largest class. In this paper, we propose the use of Genetic Programming (GP) to implement an aggressive, yet very effective, selection of attributes. Our GP-based strategy is able to largely reduce dimensionality, while dealing effectively with skewed data. To this end, we exploit some of the most common feature selection metrics and, with GP, combine their results into new sets of features, obtaining a better unbiased estimate for the discriminative power of each feature. Our proposal was evaluated against each individual feature selection metric used in our GP-based solution (namely, Information Gain, χ2, Odds-Ratio, Correlation Coefficient) using a k8 cancer-rescue mutants data set, a very unbalanced collection referring to examples of p53 protein. For this data set, our solution not only increases the efficiency of the learning algorithms, with an aggressive reduction of the input space, but also significantly increases its accuracy.

international conference on conceptual structures | 2015

A Framework for Migrating Relational Datasets to NoSQL 1 .

Leonardo Cristian Rocha; Fernando Vale; Elder Cirilo; Dárlinton Barbosa; Fernando Mourão

Abstract In software development, migration from a Data Base Management System (DBMS) to another, especially with distinct characteristics, is a challenge for programmers and database administrators. Changes in the application code in order to comply with new DBMS are usually vast, causing migrations infeasible. In order to tackle this problem, we present NoSQLayer, a framework capable to support conveniently migrating from relational (i.e., MySQL) to NoSQL DBMS (i.e., MongoDB). This framework is presented in two parts: (1) migration module; and, (2) mapping module. The first one is a set of methods enabling seamless migration between DBMSs (i.e. MySQL to MongoDB). The latter provides a persistence layer to process database requests, being capable to translate and execute these requests in any DBMS, returning the data in a suitable format as well. Experiments show NoSQLayer as a handful solution suitable to handle large volume of data (e.g., Web scale) in which traditional relational DBMS might be inept in the duty.

Journal of Web Semantics | 2015

SACI: Sentiment Analysis by Collective Inspection on Social Media Content

Leonardo Cristian Rocha; Fernando Mourão; Thiago Silveira; Rodrigo Chaves; Giovanni Sá; Felipe Teixeira; Ramon Vieira; Renato Ferreira

Collective opinions observed in Social Media represent valuable information for a range of applications. On the pursuit of such information, current methods require a prior knowledge of each individual opinion to determine the collective one in a post collection. Differently, we assume that collective analysis could be better performed when exploiting overlaps among distinct posts of the collection. Thus, we propose SACI (Sentiment Analysis by Collective Inspection), a lexicon-based unsupervised method that extracts collective sentiments without concerning with individual classifications. SACI is based on a directed transition graph among terms of a post set and on a prior classification of these terms regarding their roles in consolidating opinions. Paths represent subsets of posts on this graph and the collective opinion is defined by traversing all paths. Besides demonstrating that collective analysis outperforms individual one w.r.t. approximating collection opinions, assessments on SACI show that good individual classifications do not guarantee good collective analysis and vice-versa. Further, SACI fulfills simultaneously requirements of efficacy, efficiency and handle of dynamicity posed by high demanding scenarios. Indeed, the consolidation of a SACI-based Web tool for real-time analysis of tweets evinces the usefulness of this work.

association for information science and technology | 2016

A quantitative analysis of the temporal effects on automatic text classification

Thiago Salles; Leonardo Cristian Rocha; Marcos André Gonçalves; Jussara M. Almeida; Fernando Mourão; Wagner Meira; Felipe Viegas

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well‐known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.

conference on information and knowledge management | 2015

Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification

Felipe Viegas; Marcos André Gonçalves; Wellington Santos Martins; Leonardo Cristian Rocha

Automatic Document Classification (ADC) is the basis of many important applications such as spam filtering and content organization. Naive Bayes (NB) approaches are a widely used classification paradigm, due to their simplicity, efficiency, absence of parameters and effectiveness. However, they do not present competitive effectiveness when compared to other modern statistical learning methods, such as SVMs. This is related to some characteristics of real document collections, such as class imbalance, feature sparseness and strong relationships among attributes. In this paper, we investigate whether the relaxation of the NB feature independence assumption (aka, Semi-NB approaches) can improve its effectiveness in large text collections. We propose four new Lazy Semi-NB strategies that exploit different ideas for alleviating the NB independence assumption. By being lazy, our solutions focus only on the most important features to classify a given test document, overcoming some Semi-NB issues when applied to ADC such as bias towards larger classes and overfitting and/or lack of generalization of the models. We demonstrate that our Lazy Semi-NB proposals can produce superior effectiveness when compared to state-of-the-art ADC classifiers such as SVM and KNN. Moreover, to overcome some efficiency issues of combining Semi-NB and lazy strategies, we take advantage of current manycore GPU architectures and present a massively parallelized version of the Semi-NB approaches. Our experimental results show that speedups of up to 63.36 times can be obtained when compared to serial solutions, making our proposals very practical in real-situations.

international conference on conceptual structures | 2015

Quantifying Complementarity among Strategies for Influencers’ Detection on Twitter

Alan Neves; Ramon Vieira; Fernando Mourão; Leonardo Cristian Rocha

The so-called influencer, a person with the ability to persuade people, have important role on the information diffusion in social media environments. Indeed, influencers might dictate word- of-mouth and peer recommendation, impacting tasks such as recommendation, advertising, brand evaluation, among others. Thus, a growing number of works aim to identify influencers by exploiting distinct information. Deciding about the best strategy for each domain, however, is a complex task due to the lack of consensus among these works. This paper presents a quantitative study of analysis among some of the main strategies for identifying influencers, aiming to help researchers on this decision. Besides determining semantic classes of strategies, based on the characteristics they exploit, we obtained through PCA an effective meta-learning process to combine linearly distinct strategies. As main implications, we highlight a better understanding about the selected strategies and a novel manner to alleviate the difficulty on deciding which strategy researchers would adopt.

Neurocomputing | 2018

A Genetic Programming approach for feature selection in highly dimensional skewed data

Felipe Viegas; Leonardo Cristian Rocha; Marcos André Gonçalves; Fernando Mourão; Giovanni Sá; Thiago Salles; Guilherme Andrade; Isac Sandin

High dimensionality, also known as the curse of dimensionality, is still a major challenge for automatic classification solutions. Accordingly, several feature selection (FS) strategies have been proposed for dimensionality reduction over the years. However, they potentially perform poorly in face of unbalanced data. In this work, we propose a novel feature selection strategy based on Genetic Programming, which is resilient to data skewness issues, in other words, it works well with both, balanced and unbalanced data. The proposed strategy aims at combining the most discriminative feature sets selected by distinct feature selection metrics in order to obtain a more effective and impartial set of the most discriminative features, departing from the hypothesis that distinct feature selection metrics produce different (and potentially complementary) feature space projections. We evaluated our proposal in biological and textual datasets. Our experimental results show that our proposed solution not only increases the efficiency of the learning process, reducing up to 83% the size of the data space, but also significantly increases its effectiveness in some scenarios.

acm symposium on applied computing | 2014

LEGi: context-aware lexicon consolidation by graph inspection

Giovanni Sá; Thiago Silveira; Rodrigo Chaves; Felipe Teixeira; Fernando Mourão; Leonardo Cristian Rocha

The value of subjective content available in Social Media has boosted the importance of Sentiment Analysis on this kind of scenario. However, performing Sentiment Analysis on Social Media is a challenging task, since the huge volume of short textual posts and high dynamicity inherent to it pose strict requirements of efficiency and scalability. Despite all efforts, the literature still lacks proposals that address both requirements. In this sense, we propose LEGi, a corpus-based method for consolidating context-aware sentiment lexicons. It is based on a semi-supervised strategy for propagation of lexicon-semantic classes on a transition graph of terms. Empirical analyses on two distinct domains, derived from Twitter, demonstrate that LEGi outperformed four well-established methods for lexicon consolidation. Further, we found that LEGis lexicons may improve the quality of the sentiment analysis performed by a traditional method in the literature. Thus, our results point out LEGi as a promising method for consolidating lexicons in high demanding scenarios, such as Social Media.

symposium on applied computing | 2017

A framework for unexpectedness evaluation in recommendation

Thiago Silveira; Leonardo Cristian Rocha; Fernando Mourão; Marcos André Gonçalves

Assessment of usefulness in Recommender Systems (RSs) is a major research challenge nowadays. Due to its close relation to the notion of usefulness, unexpectedness has become the focus of several works. However, there is no consensus in the literature about how to measure it. In this context, this work implements the most referenced metrics, consolidating a framework of unexpectedness assessments in recommendation, allowing us to characterize, compare and combine all those metrics. Empirical evaluations on real data and different RSs demonstrate the applicability of our framework. Besides evincing that the existing metrics diverge about which RS provides more unexpected recommendations, the framework allowed us to combine all metrics into a single one able to capture different perspectives. We expect to help researchers and professionals on RSs to understand the actual impact of distinct metrics w.r.t. unexpectedness as well as to select a proper metric to highlight gains or loses.

Information Systems | 2017

A Two-Stage Machine learning approach for temporally-robust text classification

Thiago Salles; Leonardo Cristian Rocha; Fernando Mourão; Marcos André Gonçalves; Felipe Viegas; Wagner Meira

Abstract One of the most relevant research topics in Information Retrieval is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets, reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previously introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF’s expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose a machine learning methodology to automatically learn the TWF without the need to perform any statistical tests. We also propose new strategies to incorporate the TWF into ADC algorithms, which we call temporally-aware classifiers . Experiments showed that the fully-automated temporally-aware classifiers achieved significant gains (up to 17%) when compared to their non-temporal counterparts, even outperforming some state-of-the-art algorithms (e.g., SVM) in most cases, with large reductions in execution time.

Explore More