Patricia C. Arocena | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Patricia C. Arocena is active.

Explore More

Publication

Featured researches published by Patricia C. Arocena.

very large data bases | 2015

The iBench integration metadata generator

Patricia C. Arocena; Boris Glavic; Radu Ciucanu; Renée J. Miller

Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work - the lack of comprehensive metadata generators that can be used to create benchmarks for different integration tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). iBench permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently generate very large, complex, yet realistic scenarios with different characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenarios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems.

international conference on database theory | 2010

Composing local-as-view mappings: closure and applications

Patricia C. Arocena; Ariel Fuxman; Renée J. Miller

Schema mapping composition is a fundamental operation in schema management and data exchange. The mapping composition problem has been extensively studied for a number of mapping languages, most notably source-to-target tuple-generating dependencies (s-t tgds). An important class of s-t tgds are local-as-view (LAV) tgds. This class of mappings is prevalent in practical data integration and exchange systems, and recent work by ten Cate and Kolaitis shows that such mappings possess desirable structural properties. It is known that s-t tgds are not closed under composition. That is, given two mappings expressed with s-t tgds, their composition may not be definable by any set of s-t tgds (and, in general, may not be expressible in first-order logic). Despite their importance and extensive use in data integration and exchange systems, the closure properties of LAV composition remained open to date. The most important contribution of this paper is to show that LAV tgds are closed under composition, and provide an algorithm to directly compute the composition. An important application of our composition result is that it helps to understand if given a LAV mapping Mst from schema S to schema T, and a LAV mapping Mts from schema T back to S, the composition of Mst and Mts is able to recover the information in any instance of S. Arenas et al. formalized this notion and showed that general s-t tgds mappings always have a recovery. Hence, a LAV mapping always has a recovery. However, the problem of testing whether a given Mts is a recovery of Mst is known to be undecidable for general s-t tgds. In contrast, in this paper we show the tractability of the problem for LAV mappings, and give a polynomial-time algorithm to solve it.

very large data bases | 2015

Messing up with BART: error generation for evaluating data-cleaning algorithms

Patricia C. Arocena; Boris Glavic; Giansalvatore Mecca; Renée J. Miller; Paolo Papotti; Donatello Santoro

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

international conference on management of data | 2013

Value invention in data exchange

Patricia C. Arocena; Boris Glavic; Renée J. Miller

The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.

very large data bases | 2015

Gain control over your integration evaluations

Patricia C. Arocena; Radu Ciucanu; Boris Glavic; Renée J. Miller

Integration systems are typically evaluated using a few real-world scenarios (e.g., bibliographical or biological datasets) or using synthetic scenarios (e.g., based on star-schemas or other patterns for schemas and constraints). Reusing such evaluations is a cumbersome task because their focus is usually limited to showcasing a specific feature of an approach. This makes it difficult to compare integration solutions, understand their generality, and understand their performance for different application scenarios. Based on this observation, we demonstrate some of the requirements for developing integration benchmarks. We argue that the major abstractions used for integration problems have converged in the last decade which enables the application of robust empirical methods to integration problems (from schema evolution, to data exchange, to answering queries using views and many more). Specifically, we demonstrate that schema mappings are the main abstraction that now drives most integration solutions and show how a metadata generator can be used to create more credible evaluations of the performance and scalability of data integration systems. We will use the demonstration to evangelize for more robust, shared empirical evaluations of data integration systems.

business intelligence for the real-time enterprises | 2012

The Vivification Problem in Real-Time Business Intelligence: A Vision

Patricia C. Arocena; Renée J. Miller; John Mylopoulos

In the new era of Business Intelligence (BI) technology, transforming massive amounts of data into high-quality business information is essential. To achieve this, two non-overlapping worlds need to be aligned: the Information Technology (IT) world, represented by an organization’s operational data sources and the technologies that manage them (data warehouses, schemas, queries, ...), and the business world, portrayed by business plans, strategies and goals that an organization aspires to fulfill. Alignment in this context means mapping business queries into BI queries, and interpreting the data retrieved from the BI query in business terms. We call the creation of this interpretation the vivification problem. The main thesis of this position paper is that solutions to the vivification problem should be based on a formal framework that explicates assumptions and the other ingredients (schemas, queries, etc.) that affect it. Also, that there should be a correctness condition that explicates when a response to a business schema query is correct. The paper defines the vivification problem in detail and sketches approaches towards a solution.

international conference on management of data | 2016

BART in Action: Error Generation and Empirical Evaluations of Data-Cleaning Systems

Donatello Santoro; Patricia C. Arocena; Boris Glavic; Giansalvatore Mecca; Renée J. Miller; Paolo Papotti

Repairing erroneous or conflicting data that violate a set of constraints is an important problem in data management. Many automatic or semi-automatic data-repairing algorithms have been proposed in the last few years, each with its own strengths and weaknesses. Bart is an open-source error-generation system conceived to support thorough experimental evaluations of these data-repairing systems. The demo is centered around three main lessons. To start, we discuss how generating errors in data is a complex problem, with several facets. We introduce the important notions of detectability and repairability of an error, that stand at the core of Bart. Then, we show how, by changing the features of errors, it is possible to influence quite significantly the performance of the tools. Finally, we concretely put to work five data-repairing algorithms on dirty data of various kinds generated using Bart, and discuss their performance.

IEEE Data(base) Engineering Bulletin | 2016

Benchmarking Data Curation Systems

Patricia C. Arocena; Boris Glavic; Giansalvatore Mecca; Renée J. Miller; Paolo Papotti; Donatello Santoro

Synthesis Lectures on Data Management | 2013

Perspectives on Business Intelligence

Raymond T. Ng; Patricia C. Arocena; Denilson Barbosa; Giuseppe Carenini; Luiz Gomes; Stephan Jou; Rock Anthony Leung; Evangelos Milios; Renée J. Miller; John Mylopoulos; Rachel Pottinger; Frank Wm. Tompa; Eric S. K. Yu

Archive | 2015