Leonardo C. da Rocha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Leonardo C. da Rocha is active.

Explore More

Publication

Featured researches published by Leonardo C. da Rocha.

international conference on conceptual structures | 2013

G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering

Guilherme Andrade; Gabriel Ramos; Daniel Madeira; Rafael Sachetto; Renato Ferreira; Leonardo C. da Rocha

Abstract With the advent of Web 2.0, we see a new and differentiated scenario: there is more data than that can be effectively analyzed. Organizing this data has become one of the biggest problems in Computer Science. Many algorithms have been proposed for this purpose, highlighting those related to the Data Mining area, specifically the clustering algorithms. However, these algo- rithms are still a computational challenge because of the volume of data that needs to be processed. We found in the literature some proposals to make these algorithms feasible, and, recently, those related to parallelization on graphics processing units (GPUs) have presented good results. In this work we present the G-DBSCAN, a GPU parallel version of one of the most widely used clustering algorithms, the DBSCAN. Although there are other parallel versions of this algorithm, our technique distinguishes itself by the simplicity with which the data are indexed, using graphs, allowing various parallelization opportu- nities to be explored. In our evaluation we show that the G-DBSCAN using GPU, can be over 100 x faster than its sequential version using CPU.

measurement and modeling of computer systems | 2004

A characterization of broadband user behavior and their e-business activities

Humberto T. Marques Neto; Jussara M. Almeida; Leonardo C. da Rocha; Wagner Meira; Pedro Henrique Calais Guerra; Virgílio A. F. Almeida

This paper presents a characterization of broadband user behavior from an Internet Service Provider standpoint. Users are broken into two major categories: residential and Small-Office/Home-Office (SOHO). For each user category, the characterization is performed along four criteria: (i) session arrival process, (ii) session duration, (iii) number of bytes transferred within a session and (iv) user request patterns.Our results show that both residential and SOHO session inter-arrival times are exponentially distributed. Whereas residential session arrival rates remain relatively high during the day, SOHO session arrival rates vary much more significantly during the day. On the other hand, a typical SOHO user session is longer and transfers a larger volume of data. Furthermore, our analysis uncovers two main groups of session request patterns within each user category. The first group consists of user sessions that use traditional Internet services, such as e-mail, instant messenger and, mostly, www services. On the other hand, sessions from the second group, a smaller group, use typically peer-to-peer file sharing applications, remain active for longer periods and transfer a large amount of data. Looking further into the e-business services most commonly accessed, we found that subscription-based and advertising services account for the vast majority of user HTTP requests in both residential and SOHO workloads. Understanding these user behavior patterns is important to the development of more efficient applications for broadband users.

web search and data mining | 2008

Understanding temporal aspects in document classification

Fernando Mourão; Leonardo C. da Rocha; Renata Braga Araújo; Thierson Couto; Marcos André Gonçalves; Wagner Meira

Due to the increasing amount of information present on the Web, Automatic Document Classification (ADC) has become an important research topic. ADC usually follows a standard supervised learning strategy, where we first build a model using preclassified documents and then use it to classify new unseen documents. One major challenge for ADC in many scenarios is that the characteristics of the documents and the classes to which they belong may change over time. However, most of the current techniques for ADC are applied without taking into account the temporal evolution of the collection of documents In this work, we perform a detailed study of the temporal evolution in the ADC, introducing an analysis methodology. We discuss that temporal evolution may be explained by three factors: 1) class distribution; 2) term distribution; and 3) class similarity. We employ metrics and experimental strategies capable of isolating each of these factors in order to analyze them separately, using two very different document collections: the ACM Digital Library and the Medline medical collections. Moreover, we present some preliminary results of potential gains that could be obtained by varying the training set to find the ideal size that minimizes the time effects. We show that by using just 69% of the ACM database, we are able to have an accuracy of 89.76%, and with only 25% of the Medline, an accuracy of 87.57%, which means gains of up to 20% in accuracy with much smaller training sets

international acm sigir conference on research and development in information retrieval | 2010

Temporally-aware algorithms for document classification

Thiago Salles; Leonardo C. da Rocha; Gisele L. Pappa; Fernando Mourão; Wagner Meira; Marcos André Gonçalves

Automatic Document Classification (ADC) is still one of the major information retrieval problems. It usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use this model to classify unseen documents. The majority of supervised algorithms consider that all documents provide equally important information. However, in practice, a document may be considered more or less important to build the classification model according to several factors, such as its timeliness, the venue where it was published in, its authors, among others. In this paper, we are particularly concerned with the impact that temporal effects may have on ADC and how to minimize such impact. In order to deal with these effects, we introduce a temporal weighting function (TWF) and propose a methodology to determine it for document collections. We applied the proposed methodology to ACM-DL and Medline and found that the TWF of both follows a lognormal. We then extend three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF. Experiments showed that the temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms.

conference on information and knowledge management | 2008

Exploiting temporal contexts in text classification

Leonardo C. da Rocha; Fernando Mourão; Adriano C. M. Pereira; Marcos André Gonçalves; Wagner Meira

Due to the increasing amount of information being stored and accessible through the Web, Automatic Document Classification (ADC) has become an important research topic. ADC usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use it to classify unseen documents. One major challenge in building classifiers is dealing with the temporal evolution of the characteristics of the documents and the classes to which they belong. However, most of the current techniques for ADC do not consider this evolution while building and using the models. Previous results show that the performance of classifiers may be affected by three different temporal effects (class distribution, term distribution and class similarity). Further, it is shown that using just portions of the pre-classified documents, which we call contexts, for building the classifiers, result in better performance, as a consequence of the minimization of the aforementioned effects. In this paper we define the concept of temporal contexts as being the portions of documents that minimize those effects. We then propose a general algorithm for determining such contexts, discuss its implementation-related issues, and propose a heuristic that is able to determine temporal contexts efficiently. In order to demonstrate the effectiveness of our strategy, we evaluated it using two distinct collections: ACM-DL and MedLine. We initially evaluated the reduction in terms of both the effort to build a classifier and the entropy associated with each context. Further, we evaluated whether these observed reductions translate into better classification performance by employing a very simple classifier, majority voting. The results show that we achieved precision gains of up to 30% compared to a version that is not temporally contextualized, and the same accuracy of a state-of-the-art classifier (SVM), while presenting an execution time up to hundreds of times faster.

Proceedings of the 2004 ACM workshop on Next-generation residential broadband challenges | 2004

Characterizing broadband user behavior

Humberto T. Marques; Leonardo C. da Rocha; Pedro Henrique Calais Guerra; Jussara M. Almeida; Wagner Meira; Virgílio A. F. Almeida

This paper presents a characterization of broadband user behavior from a Internet service provider. Users are broken into two major categories: residential and Small-Office/Home-Office (SOHO). For each user category, the characterization is performed along four criteria: (i) session arrival process, (ii) session duration, (iii) number of bytes transferred within a session and (iv) user request patterns. Our results show that both residential and SOHO session inter-arrival times are exponentially distributed. Whereas residential session arrival rates remain relatively high during the day, SOHO session arrival rates vary much more significantly during the day. On the other hand, a typical SOHO user session is longer and transfers a larger volume of data. Furthermore, our analysis uncovers two main groups of session request patterns within each user category. Sessions from the first group use traditional Internet services, such as www, e-mail and instant messengers, and sessions from the second, a smaller group, use typically file sharing applications (peer-to-peer). This second group remains for longer periods and transfers a large amount of data. Understanding these user behavior patterns is important to the development of more efficient applications for broadband users.

symposium on computer architecture and high performance computing | 2013

GPU-NB: A Fast CUDA-Based Implementation of Naïve Bayes

Felipe Viegas; Guilherme Andrade; Jussara M. Almeida; Renato Ferreira; Marcos André Gonçalves; Gabriel Ramos; Leonardo C. da Rocha

The advent of the Web 2.0 has given rise to an interesting phenomenon: there is currently much more data than what can be effectively analyzed without relying on sophisticated automatic tools. Some of these tools, which target the organization and extraction of useful knowledge from this huge amount of data, rely on machine learning and data or text mining techniques, specifically automatic document classification algorithms. However, these algorithms are still a computational challenge because of the volume of data that needs to be processed. Some of the strategies available to address this challenge are based on the parallelization of ADC algorithms. In this work, we present GPU-NB, a parallel version of one of the most widely used document classification algorithms, the Naïve Bayes, that uses graphics processing units (GPUs). In our evaluation using 6 different document collections, we show that the GPU-NB can maintain the same classification effectiveness (in most cases) while increasing the efficiency by up to 34x faster than its sequential version using CPU. GPU-NB is also up to 11x faster than a CPU-based parallel implementation of Naive Bayes running with 4 threads. Moreover, assuming an optimistic behavior of the CPU parallelization, GPU-NB should outperform the CPU-based implementation with up to 32 cores, at a small fraction of the cost. We also show that the efficiency of the GPU-NB parallelization is impacted by features of the document collections, particularly the number of classes, although the density of the collection (average number of occurrences of terms per document) has a significant impact as well.

international acm sigir conference on research and development in information retrieval | 2015

BROOF: Exploiting Out-of-Bag Errors, Boosting and Random Forests for Effective Automated Classification

Thiago Salles; Marcos André Gonçalves; Victor Augusto Bianchetti Rodrigues; Leonardo C. da Rocha

Random Forests (RF) and Boosting are two of the most successful supervised learning paradigms for automatic classification. In this work we propose to combine both strategies in order to exploit their strengths while simultaneously solving some of their drawbacks, especially when applied to high-dimensional and noisy classification tasks. More specifically, we propose a boosted version of the RF classifier (BROOF), which fits an additive model composed by several random forests (as weak learners). Differently from traditional boosting methods which exploit the training error estimate, we here use the stronger out-of-bag (OOB) error estimate which is an out-of-the-box estimate naturally produced by the bagging method used in RFs. The influence of each weak learner in the fitted additive model is inversely proportional to their OOB error. Moreover, the probability of selecting an out-of-bag training example is increased if misclassified by the simpler weak learners, in order to enable the boosted model to focus on complex regions of the input space. We also adopt a selective weight updating procedure, whereas only the out-of-bag examples are updated as the boosting iterations go by. This serves the purpose of slowing down the tendency to focus on just a few hard-to-classify examples. By mitigating this undesired bias known to affect boosting algorithms under high dimensional and noisy scenarios - due to both the selective weighting schema and a proper weak-learner effectiveness assessment - we greatly improve classification effectiveness. Our experiments with several datasets in three representative high-dimensional and noisy domains - topic, sentiment and microarray data classification - an up to ten state-of-the-art classifiers (covering almost 500 results), show that BROOF is the only classifier to be among the top performers in all tested datasets from the topic classification domain, and in the vast majority of cases in sentiment and microarray domains, a surprising result given the knowledge that there is no single top-notch classifier for all datasets.

Information Systems | 2013

Temporal contexts: Effective text classification in evolving document collections

Leonardo C. da Rocha; Fernando Mourão; Hilton de Oliveira Mota; Thiago Salles; Marcos André Gonçalves; Wagner Meira

The management of a huge and growing amount of information available nowadays makes Automatic Document Classification (ADC), besides crucial, a very challenging task. Furthermore, the dynamics inherent to classification problems, mainly on the Web, make this task even more challenging. Despite this fact, the actual impact of such temporal evolution on ADC is still poorly understood in the literature. In this context, this work concerns to evaluate, characterize and exploit the temporal evolution to improve ADC techniques. As first contribution we highlight the proposal of a pragmatical methodology for evaluating the temporal evolution in ADC domains. Through this methodology, we can identify measurable factors associated to ADC models degradation over time. Going a step further, based on such analyzes, we propose effective and efficient strategies to make current techniques more robust to natural shifts over time. We present a strategy, named temporal context selection, for selecting portions of the training set that minimize those factors. Our second contribution consists of proposing a general algorithm, called Chronos, for determining such contexts. By instantiating Chronos, we are able to reduce uncertainty and improve the overall classification accuracy. Empirical evaluations of heuristic instantiations of the algorithm, named WindowsChronos and FilterChronos, on two real document collections demonstrate the usefulness of our proposal. Comparing them against state-of-the-art ADC algorithms shows that selecting temporal contexts allows improvements on the classification accuracy up to 10%. Finally, we highlight the applicability and the generality of our proposal in practice, pointing out this study as a promising research direction.

job scheduling strategies for parallel processing | 2005

AnthillSched: a scheduling strategy for irregular and iterative I/O-intensive parallel jobs

Luís Fabrício Wanderley Góes; Pedro Henrique Calais Guerra; Bruno Coutinho; Leonardo C. da Rocha; Wagner Meira; Renato Ferreira; Dorgival O. Guedes; Walfredo Cirne

Irregular and iterative I/O-intensive jobs need a different approach from parallel job schedulers. The focus in this case is not only the processing requirements anymore: memory, network and storage capacity must all be considered in making a scheduling decision. Job executions are irregular and data dependent, alternating between CPU-bound and I/O-bound phases. In this paper, we propose and implement a parallel job scheduling strategy for such jobs, called AnthillSched, based on a simple heuristic: we map the behavior of a parallel application with minimal resources as we vary its input parameters. From that mapping we infer the best scheduling for a certain set of input parameters given the available resources. To test and verify AnthillSched we used logs obtained from a real system executing data mining jobs. Our main contributions are the implementation of a parallel job scheduling strategy in a real system and the performance analysis of AnthillSched, which allowed us to discard some other scheduling alternatives considered previously.

Explore More