Leonardo Andrade Ribeiro
Universidade Federal de Goiás
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Leonardo Andrade Ribeiro.
Information Systems | 2011
Leonardo Andrade Ribeiro; Theo Härder
Identification of all pairs of objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to known algorithms.
advances in databases and information systems | 2009
Leonardo Andrade Ribeiro; Theo Härder
Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.
advances in databases and information systems | 2008
Leonardo Andrade Ribeiro; Theo Härder
A similarity join correlating fragments in XML documents, which are similar in structure and content, can be used as the core algorithm to support data cleaning and data integration tasks. For this reason, built-in support for such an operator in an XML database management system (XDBMS) is very attractive. However, similarity assessment is especially difficult on XML datasets, because structure, besides textual information, may embody variations in XML documents representing the same real-world entity. Moreover, the similarity computation is considerably more expensive for tree-structured objects and should, therefore, be a prime optimization candidate. In this paper, we explore and optimize tree-based similarity joins and analyze their performance and accuracy when embedded in native XDBMSs.
international database engineering and applications symposium | 2009
Leonardo Andrade Ribeiro; Theo Härder; Fernanda S. Pimenta
A natural consequence of the widespread adoption of XML as standard for information representation and exchange is the redundant storage of large amounts of persistent XML documents. Compared to relational data tables, data represented in XML format can potentially be even more sensitive to data quality issues because structure, besides textual information, may cause variations in XML documents representing the same information entity. Therefore, correlating XML documents, which are similar in content an structure, is a fundamental operation. In this paper, we present an effective, flexible, and high-performance XML-based similarity join framework. We exploit structural summaries and clustering concepts to produce compact and high-quality XML document representations: our approach outperforms previous work both in terms of performance and accuracy. In this context, we explore different ways to weigh and combine evidence from textual and structural XML representations. Furthermore, we address user interaction, when the similarity framework is configured for a specific domain, and updatability of clustering information, when new documents enter datasets under consideration. We present a thorough experimental evaluation to validate our techniques in the context of a native XML DBMS.
acm symposium on applied computing | 2015
Christiane Faleiro Sidney; Diego Sarmento Mendes; Leonardo Andrade Ribeiro; Theo Härder
Query performance prediction is essential for many important tasks in cloud-based database management including resource provisioning, admission control, and pricing. Recently, there has been some work on building prediction models to estimate execution time of traditional SQL queries. While suitable for typical OLTP/OLAP workloads, these existing approaches are insufficient to model performance of complex data processing activities for deep analytics such as cleaning and integration of data. These activities are largely based on similarity operations---radically different from regular relational operators. In this paper, we consider prediction models for set similarity joins. We exploit knowledge of optimization techniques and design details popularly found in set similarity join algorithms to identify relevant features, which are then used to construct prediction models based on statistical machine learning. An extensive experimental evaluation confirms the accuracy of our approach.
database and expert systems applications | 2016
Leonardo Andrade Ribeiro; Alfredo Cuzzocrea; Karen Aline Alves Bezerra; Ben Hur Bahia do Nascimento
Data cleaning and integration found on duplicate record identification, which aims at detecting duplicate records that represent the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm meant for grouping together records that refer to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this paper we propose and experimentally assess SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task, carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results, which are derived from an extensive experimental campaign, we retrieve are really surprising, as we are able to outperform the original set similarity join algorithm by an order of magnitude in most settings.
international conference on enterprise information systems | 2017
Rafael David Quirino; Sidney R. Junior; Leonardo Andrade Ribeiro; Wellington Santos Martins
Set similarity join is a core operation for text data integration, cleaning and mining. Most state-of-the-art solutions rely on inherently sequential, CPU-based algorithms. In this paper we propose a parallel algorithm for the set similarity join problem, harnessing the power of GPU systems through filtering techniques and divide-and-conquer strategies that scales well with data size. Experiments show substantial speedups over the fastest algorithms in literature.
international conference on enterprise information systems | 2016
Leonardo Andrade Ribeiro; Alfredo Cuzzocrea; Karen Aline Alves Bezerra; Ben Hur Bahia do Nascimento
A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.
advances in databases and information systems | 2012
Leonardo Andrade Ribeiro; Theo Härder
XML is widely applied to describe semi-structured data commonly generated and used by modern information systems. XML database management systems (XDBMSs) are thus essential platforms in this context. Most XDBMS architectures proposed so far aim at reproducing functionalities found in relational systems. As such, these architectures inherit the same deficiency of traditional systems in dealing with less-structured data. What is badly needed is efficient support of common database operations under the similarity matching paradigm. In this paper, we present an engineering approach to incorporating similarity joins into XDBMSs, which exploits XDBMS components--the storage layer in particular--to design efficient algorithms. We experimentally confirm the accuracy, performance, and scalability of our approach.
advances in databases and information systems | 2018
Diego Junior do Carmo Oliveira; Felipe Ferreira Borges; Leonardo Andrade Ribeiro; Alfredo Cuzzocrea
A set similarity join finds all similar pairs from a collection of sets. This operation is essential for many important tasks in Big Data analytics including string data integration and cleaning. The vast majority of set similarity join algorithms proposed so far considers string data represented by a single set over which a simple similarity predicate is defined. However, real data is typically multi-attribute and, thus, better represented by multiple sets. Such a representation requires complex expressions to capture a given notion of similarity. Moreover, similarity join processing under this new formulation is clearly more expensive, which calls for distributed algorithms to deal with large datasets. In this paper, we present a distributed algorithm for set similarity joins with complex similarity expressions. Our approach supports complex Boolean expressions over multiple predicates. We propose a simple, but effective data partitioning strategy to reduce both communication and computation costs. We have implemented our algorithm in Spark, a popular distributed data processing engine. Experimental results show that the proposed approach is efficient and scalable.