Is this you? Create Your Porfile

Alexandra Meliou

University of Massachusetts Amherst

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexandra Meliou is active.

Explore More

Publication

Featured researches published by Alexandra Meliou.

very large data bases | 2010

The complexity of causality and responsibility for query answers and non-answers

Alexandra Meliou; Wolfgang Gatterbauer; Katherine F. Moore; Dan Suciu

An answer to a query has a well-defined lineage expression (alternatively called how-provenance) that explains how the answer was derived. Recent work has also shown how to compute the lineage of a non-answer to a query. However, the cause of an answer or non-answer is a more subtle notion and consists, in general, of only a fragment of the lineage. In this paper, we adapt Halpern, Pearl, and Chocklers recent definitions of causality and responsibility to define the causes of answers and non-answers to queries, and their degree of responsibility. Responsibility captures the notion of degree of causality and serves to rank potentially many causes by their relative contributions to the effect. Then, we study the complexity of computing causes and responsibilities for conjunctive queries. It is known that computing causes is NP-complete in general. Our first main result shows that all causes to conjunctive queries can be computed by a relational query which may involve negation. Thus, causality can be computed in PTIME, and very efficiently so. Next, we study computing responsibility. Here, we prove that the complexity depends on the conjunctive query and demonstrate a dichotomy between PTIME and NP-complete cases. For the PTIME cases, we give a non-trivial algorithm, consisting of a reduction to the max-flow computation problem. Finally, we prove that, even when it is in PTIME, responsibility is complete for LOGSPACE, implying that, unlike causality, it cannot be computed by a relational query.

international conference on management of data | 2014

Fusing data with correlations

Ravali Pochampally; Anish Das Sarma; Xin Luna Dong; Alexandra Meliou; Divesh Srivastava

Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown that a naive voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources. However, correlation between sources can be much broader than copying: sources may provide data from complementary domains (negative correlation), extractors may focus on different types of information (negative correlation), and extractors may apply common rules in extraction (positive correlation, without copying). In this paper we present novel techniques modeling correlations between sources and applying it in truth finding. We provide a comprehensive evaluation of our approach on three real-world datasets with different characteristics, as well as on synthetic data, showing that our algorithms outperform the existing state-of-the-art techniques.

information processing in sensor networks | 2006

Data gathering tours in sensor networks

Alexandra Meliou; David Chu; Carlos Guestrin; Joseph M. Hellerstein; Wei Hong

A basic task in sensor networks is to interactively gather data from a subset of the sensor nodes. When data needs to be gathered from a selected set of nodes in the network, existing communication schemes often behave poorly. In this paper, we study the algorithmic challenges in efficiently routing a fixed-size packet through a small number of nodes in a sensor network, picking up data as the query is routed. We show that computing the optimal routing scheme to visit a specific set of nodes is NP-complete, but we develop approximation algorithms that produce plans with costs within a constant factor of the optimum. We enhance the robustness of our initial approach to accommodate the practical issues of limited-sized packets as well as network link and node failures, and examine how different approaches behave with dynamic changes in the network topology. Our theoretical results are validated via an implementation of our algorithms on the TinyOS platform and a controlled simulation study using Matlab and TOSSIM

international conference on management of data | 2011

Tracing data errors with view-conditioned causality

Alexandra Meliou; Wolfgang Gatterbauer; Suman Nath; Dan Suciu

A surprising query result is often an indication of errors in the query or the underlying data. Recent work suggests using causal reasoning to find explanations for the surprising result. In practice, however, one often has multiple queries and/or multiple answers, some of which may be considered correct and others unexpected. In this paper, we focus on determining the causes of a set of unexpected results, possibly conditioned on some prior knowledge of the correctness of another set of results. We call this problem View-Conditioned Causality. We adapt the definitions of causality and responsibility for the case of multiple answers/views and provide a non-trivial algorithm that reduces the problem of finding causes and their responsibility to a satisfiability problem that can be solved with existing tools. We evaluate both the accuracy and effectiveness of our approach on a real dataset of user-generated mobile device tracking data, and demonstrate that it can identify causes of error more effectively than static Boolean influence and alternative notions of causality.

international conference on management of data | 2015

Data X-Ray: A Diagnostic Tool for Data Errors

Xiaolan Wang; Xin Luna Dong; Alexandra Meliou

A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process. We develop a large-scale diagnostic framework called DATA X-RAY. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.

very large data bases | 2014

Causality and explanations in databases

Alexandra Meliou; Sudeepa Roy; Dan Suciu

With the surge in the availability of information, there is a great demand for tools that assist users in understanding their data. While todays exploration tools rely mostly on data visualization, users often want to go deeper and understand the underlying causes of a particular observation. This tutorial surveys research on causality and explanation for data-oriented applications. We will review and summarize the research thus far into causality and explanation in the database and AI communities, giving researchers a snapshot of the current state of the art on this topic, and propose a unified framework as well as directions for future research. We will cover both the theory of causality/explanation and some applications; we also discuss the connections with other topics in database research like provenance, deletion propagation, why-not queries, and OLAP techniques.

foundations of software engineering | 2017

Fairness testing: testing software for discrimination

Sainyam Galhotra; Yuriy Brun; Alexandra Meliou

This paper defines software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior. Evidence of software discrimination has been found in modern software systems that recommend criminal sentences, grant access to financial products, and determine who is allowed to participate in promotions. Our approach, Themis, generates efficient test suites to measure discrimination. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3) Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4) Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination.

foundations of software engineering | 2013

Data debugging with continuous testing

Kıvanç Muşlu; Yuriy Brun; Alexandra Meliou

Today, systems rely as heavily on data as on the software that manipulates those data. Errors in these systems are incredibly costly, annually resulting in multi-billion dollar losses, and, on multiple occasions, in death. While software debugging and testing have received heavy research attention, less effort has been devoted to data debugging: discovering system errors caused by well-formed but incorrect data. In this paper, we propose continuous data testing: using otherwise-idle CPU cycles to run test queries, in the background, as a user or database administrator modifies a database. This technique notifies the user or administrator about a data bug as quickly as possible after that bug is introduced, leading to at least three benefits: (1) The bug is discovered quickly and can be fixed before it is likely to cause a problem. (2) The bug is discovered while the relevant change is fresh in the users or administrators mind, increasing the chance that the underlying cause of the bug, as opposed to only the discovered side-effect, is fixed. (3) When poor documentation or company policies contribute to bugs, discovering the bug quickly is likely to identify these contributing factors, facilitating updating documentation and policies to prevent similar bugs in the future. We describe the problem space and potential benefits of continuous data testing, our vision for the technique, challenges we encountered, and our prototype implementation for PostgreSQL. The prototypes low overhead shows promise that continuous data testing can address the important problem of data debugging.

very large data bases | 2016

Scalable package queries in relational database systems

Matteo Brucato; Juan Felipe Beltran; Azza Abouzied; Alexandra Meliou

Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However, many practical, real-world problems require a collection of result tuples to satisfy constraints collectively, rather than individually. In this paper, we present package queries, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets. We develop a full-fledged package query system, implemented on top of a traditional database engine. Our work makes several contributions. First, we design PaQL, a SQL-based query language that supports the declarative specification of package queries. We prove that PaQL is at least as expressive as integer linear programming, and therefore, evaluation of package queries is in general NP-hard. Second, we present a fundamental evaluation strategy that combines the capabilities of databases and constraint optimization solvers to derive solutions to package queries. The core of our approach is a set of translation rules that transform a package query to an integer linear program. Third, we introduce an offline data partitioning strategy allowing query evaluation to scale to large data sizes. Fourth, we introduce SketchRefine, a scalable algorithm for package evaluation, with strong approximation guarantees ((1 ± e)6-factor approximation). Finally, we present extensive experiments over real-world and benchmark data. The results demonstrate that SketchRefine is effective at deriving high-quality package results, and achieves runtime performance that is an order of magnitude faster than directly using ILP solvers over large datasets.

international symposium on software testing and analysis | 2015

Preventing data errors with continuous testing

Kıvanç Muşlu; Yuriy Brun; Alexandra Meliou

Today, software systems that rely on data are ubiquitous, and ensuring the datas quality is an increasingly important challenge as data errors result in annual multi-billion dollar losses. While software debugging and testing have received heavy research attention, less effort has been devoted to data debugging: identifying system errors caused by well-formed but incorrect data. We present continuous data testing (CDT), a low-overhead, delay-free technique that quickly identifies likely data errors. CDT continuously executes domain-specific test queries; when a test fails, CDT unobtrusively warns the user or administrator. We implement CDT in the ConTest prototype for the PostgreSQL database management system. A feasibility user study with 96 humans shows that ConTest was extremely effective in a setting with a data entry application at guarding against data errors: With ConTest, users corrected 98.4% of their errors, as opposed to 40.2% without, even when we injected 40% false positives into ConTests output. Further, when using ConTest, users corrected data entry errors 3.2 times faster than when using state-of-the-art methods.

Explore More