Demetrio Gomes Mestre

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Demetrio Gomes Mestre is active.

Explore More

Publication

Featured researches published by Demetrio Gomes Mestre.

international symposium on computers and communications | 2013

Improving load balancing for MapReduce-based entity matching

Demetrio Gomes Mestre; Carlos Eduardo Santos Pires

The effectiveness and scalability of MapReduce-based implementations for data-intensive tasks depends on the data assignment made from map to reduce tasks. The robustness of this assignment strategy is crucial to achieve skewed data handling and balanced workload distribution among all reduce tasks. For the entity matching problem in the Big Data context, we propose BlockSlicer, a MapReduce-based approach that supports blocking techniques to reduce the entity matching search space. The approach utilizes a preprocessing MapReduce job to analyze the data distribution and provides an improved load balancing by applying an efficient block slice strategy as well as a well-known optimization algorithm to assign the generated match tasks. We evaluate the approach against an existing one that addresses the same problem on a real cloud infrastructure. The results show that our approach increases significantly the performance of distributed entity matching task by reducing the amount of data generated from the map phase and diminishing the overall execution time.

acm symposium on applied computing | 2015

Adaptive sorted neighborhood blocking for entity matching with MapReduce

Demetrio Gomes Mestre; Carlos Eduardo S. Pires; Dimas Cassimiro Nascimento

Cloud computing has proven to be a powerful ally to efficient parallel execution of data-intensive tasks such as Entity Matching (EM) in the era of Big Data. For this reason, studies about challenges and possible solutions of how EM can benefit from the cloud computing paradigm have become an important demand nowadays. In this paper, we investigate how the MapReduce programming model can be used to perform efficient parallel EM using a variation of the Sorted Neighborhood Method (SNM) that uses a varying size (adaptive) window. We propose MapReduce Duplicate Count Strategy (MR--DCS ++), an efficient MapReduce-based approach for the adaptive SNM, aiming to increase even more the performance of SNM. The evaluation results based on real-world datasets and cloud infrastructure show that our approach increases the performance of MapReduce-based SNM by providing better results for the EM execution time.

Applied Intelligence | 2016

Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

Dimas Cassimiro Nascimento; Carlos Eduardo S. Pires; Demetrio Gomes Mestre

Deduplication is the task of identifying the entities in a data set which refer to the same real world object. Over the last decades, this problem has been largely investigated and many techniques have been proposed to improve the efficiency and effectiveness of the deduplication algorithms. As data sets become larger, such algorithms may generate critical bottlenecks regarding memory usage and execution time. In this context, cloud computing environments have been used for scaling out data quality algorithms. In this paper, we investigate the efficacy of different machine learning techniques for scaling out virtual clusters for the execution of deduplication algorithms under predefined time restrictions. We also propose specific heuristics (Best Performing Allocation, Probabilistic Best Performing Allocation, Tunable Allocation, Adaptive Allocation and Sliced Training Data) which, together with the machine learning techniques, are able to tune the virtual cluster estimations as demands fluctuate over time. The experiments we have carried out using multiple scale data sets have provided many insights regarding the adequacy of the considered machine learning algorithms and proposed heuristics for tackling cloud computing provisioning.

acm symposium on applied computing | 2015

A data quality-aware cloud service based on metaheuristic and machine learning provisioning algorithms

Dimas C. Nascimento; Carlos Eduardo S. Pires; Demetrio Gomes Mestre

Cloud Computing as a service has become a topic of increasing interest. The outsourcing of duties and infrastructure to external parties became a crucial concept for many business models. In this paper we discuss the design and experimental evaluation of provisioning algorithms, in a Data Quality-aware Service (DQaS) context, that enables dynamic Data Quality Service Level Agreements (DQSLA) management and optimization of cloud resources. The DQaS has been designed to respond effectively to the DQSLA requirements of the service customers, by minimizing SLA penalties and provisioning the cloud infrastructure for the execution of data quality algorithms. An experimental evaluation of the proposed provisioning algorithms, carried out through simulation, has provided very encouraging results that confirm the adequacy of these algorithms in the DQaS context.

international database engineering and applications symposium | 2017

Towards Reliable Data Analyses for Smart Cities

Tiago Brasileiro Araújo; Cinzia Cappiello; Nadia Puchalski Kozievitch; Demetrio Gomes Mestre; Carlos Eduardo S. Pires; Monica Vitali

As cities are becoming green and smart, public information systems are being revamped to adopt digital technologies. There are several sources (official or not) that can provide information related to a city. The availability of multiple sources enables the design of advanced analyses for offering valuable services to both citizens and municipalities. However, such analyses would fail if the considered data were affected by errors and uncertainties: Data Quality is one of the main requirements for the successful exploitation of the available information. This paper highlights the importance of the Data Quality evaluation in the context of geographical data sources. Moreover, we describe how the Entity Matching task can provide additional information to refine the quality assessment and, consequently, obtain a better evaluation of the reliability data sources. Data gathered from the public transportation and urban areas of Curitiba, Brazil, are used to show the strengths and effectiveness of the presented approach.

ieee international conference on cloud computing technology and science | 2015

Data Quality Monitoring of Cloud Databases Based on Data Quality SLAs

Dimas Cassimiro Nascimento; Carlos Eduardo S. Pires; Demetrio Gomes Mestre

This chapter provides an overview of the tasks related to the continuous process of monitoring the quality of cloud databases as their content is modified over time. In the Software as a Service context, this process must be guided by data quality service level agreements, which aim to specify customers’ requirements regarding the process of data quality monitoring. In practice, factors such as the Big Data scale, lack of data structure, strict service level agreement requirements, and the velocity of the changes over the data imply many challenges for an effective accomplishment of this process. In this context, we present a high-level architecture of a cloud service, which employs cloud computing capabilities in order to tackle these challenges, as well as the technical and research problems that may be further explored to allow an effective deployment of the presented service.

acm symposium on applied computing | 2018

Blind attribute pairing for privacy-preserving record linkage

Thiago Pereira da Nóbrega; Carlos Eduardo S. Pires; Tiago Brasileiro Araújo; Demetrio Gomes Mestre

In many scenarios, it is necessary to identify records referring to the same real-world object across different data sources (Record Linkage). Yet, such need is often in contrast with privacy requirements concerning (e.g., identify patients with the same diseases, genome matching, and fraud detection). Thus, in the cases where the parties interested in the Record Linkage process need to preserve the privacy of their data, Privacy-Preserving Record Linkage (PPRL) approaches are applied to address the privacy problem. In this sense, the first step of PPRL is the agreement of the parties about the data (attributes) that will be used during the record linkage process. Thus, to reach an agreement, the parties must share information about their data schema, which in turn can be utilized to break the data privacy. To overcome the (vulnerability) problem caused by the schema information sharing, we propose a novel privacy-preserving approach for attribute pairing to aid PPRL applications. Empirical experiments demonstrate that our privacy-preserving approach improves considerably the efficiency and effectiveness in comparison to a state-of-the-art baseline.

Journal of Systems and Software | 2018

Heuristic-based approaches for speeding up incremental record linkage

Dimas Cassimiro Nascimento; Carlos Eduardo S. Pires; Demetrio Gomes Mestre

Abstract Record Linkage is the task of processing a dataset in order to identify which records refer to the same real world entity. The intrinsic complexity of this task brings many challenges to traditional or naive approaches, especially in contexts such as Big Data, unstructured data and frequent data increments over the dataset. To deal with these contexts, especially the latter, an incremental record linkage approach may be employed in order to avoid (re)processing the entire dataset to update the deduplication results. For doing so, different classification techniques can be employed to identify duplicate entities. Recently, many algorithms have been proposed to combine collective classification, which employs clustering algorithms, together with the incremental principle. In this article, we propose new metrics for incremental record linkage using collective classification and new heuristics (which combine clustering, coverage component filters and a greedy approach) to speed up even more a solution to incremental record linkage. These heuristics have been evaluated using three different scale datasets and the results were analyzed and discussed based on both classical and the newly proposed metrics. The experiments present different trade-offs, regarding efficacy and efficiency results, which are generated by the considered heuristics. Also, the results indicate that, for large and frequent data increments, it is possible to slightly reduce efficacy results by employing a coverage filter-based heuristic that is reasonably faster than the current state-of-the-art approach. In turn, it is also possible to employ single-pass clustering algorithms, which are able to execute significantly faster than the state-of-the-art approach at the cost of sacrificing precision results.

Journal of Systems and Software | 2017