Dilys Thomas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dilys Thomas is active.

Explore More

Publication

Featured researches published by Dilys Thomas.

international conference on database theory | 2005

Anonymizing tables

Gagan Aggarwal; Tomás Feder; Krishnaram Kenthapadi; Rajeev Motwani; Rina Panigrahy; Dilys Thomas; An Zhu

We consider the problem of releasing tables from a relational database containing personal records, while ensuring individual privacy and maintaining data integrity to the extent possible. One of the techniques proposed in the literature is k-anonymization. A release is considered k-anonymous if the information for each person contained in the release cannot be distinguished from at least k–1 other persons whose information also appears in the release. In the k-Anonymityproblem the objective is to minimally suppress cells in the table so as to ensure that the released version is k-anonymous. We show that the k-Anonymity problem is NP-hard even when the attribute values are ternary. On the positive side, we provide an O(k)-approximation algorithm for the problem. This improves upon the previous best-known O(klog k)-approximation. We also give improved positive results for the interesting cases with specific values of k — in particular, we give a 1.5-approximation algorithm for the special case of 2-Anonymity, and a 2-approximation algorithm for 3-Anonymity.

international conference on management of data | 2005

Privacy preserving OLAP

Rakesh Agrawal; Ramakrishnan Srikant; Dilys Thomas

We present techniques for privacy-preserving computation of multidimensional aggregates on data partitioned across multiple clients. Data from different clients is perturbed (randomized) in order to preserve privacy before it is integrated at the server. We develop formal notions of privacy obtained from data perturbation and show that our perturbation provides guarantees against privacy breaches. We develop and analyze algorithms for reconstructing counts of subcubes over perturbed data. We also evaluate the tradeoff between privacy guarantees and reconstruction accuracy and show the practicality of our approach.

very large data bases | 2004

Operator scheduling in data stream systems

Brian Babcock; Shivnath Babu; Mayur Datar; Rajeev Motwani; Dilys Thomas

Abstract.In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams - adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the runtime system memory usage as well as output latency. Our aim is to design a scheduling strategy that minimizes the maximum runtime system memory while maintaining the output latency within prespecified bounds. We first present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing runtime memory usage for any collection of single-stream queries involving selections, projections, and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams and multiple queries of the above types. However, during bursts in input streams, when there is a buildup of unprocessed tuples, Chain scheduling may lead to high output latency. We study the online problem of minimizing maximum runtime memory, subject to a constraint on maximum latency. We present preliminary observations, negative results, and heuristics for this problem. A thorough experimental evaluation is provided where we demonstrate the potential benefits of Chain scheduling and its different variants, compare it with competing scheduling strategies, and validate our analytical conclusions.

IEEE Transactions on Knowledge and Data Engineering | 2006

Generating Queries with Cardinality Constraints for DBMS Testing

Nicolas Bruno; Surajit Chaudhuri; Dilys Thomas

Good testing coverage of novel database techniques, such as multidimensional histograms or changes in the execution engine, is a complex problem. In this work, we argue that this task requires generating query instances, not randomly, but based on a given set of constraints. Specifically, obtaining query instances that satisfy cardinality constraints on their subexpressions is an important challenge. We show that this problem is inherently hard, and develop heuristics that effectively find approximate solutions

very large data bases | 2004

Vision paper: enabling privacy for the paranoids

Gagan Aggarwal; Mayank Bawa; Prasanna Ganesan; Hector Garcia-Molina; Krishnaram Kenthapadi; Nina Mishra; Rajeev Motwani; Utkarsh Srivastava; Dilys Thomas; Jennifer Widom; Ying Xu

P3P [23, 24] is a set of standards that allow corporations to declare their privacy policies. Hippocratic Databases [6] have been proposed to implement such policies within a corporations datastore. From an end-user individuals point of view, both of these rest on an uncomfortable philosophy of trusting corporations to protect his/her privacy. Recent history chronicles several episodes when such trust has been willingly or accidentally violated by corporations facing bankruptcy courts, civil subpoenas or lucrative mergers. We contend that data management solutions for information privacy must restore controls in the individuals hands. We suggest that enabling such control will require a radical re-think on modeling, release, and management of personal data.

international conference on data engineering | 2008

Auditing SQL Queries

Rajeev Motwani; Shubha U. Nabar; Dilys Thomas

We study the problem of auditing a batch of SQL queries: given a forbidden view of a database that should have been kept confidential, a batch of queries that were posed over this database and answered, and a definition of suspiciousness, determine if the query batch is suspicious with respect to the forbidden view. We consider several notions of suspiciousness that span a spectrum both in terms of their disclosure detection guarantees and the tractability of auditing under them for different classes of queries. We identify a particular notion of suspiciousness, weak syntactic suspiciousness, that allows for an efficient auditor for a large class of conjunctive queries. The auditor can be used together with a specific set of forbidden views to detect disclosures of the association between individuals and their private attributes. Further it can also be used to prevent disclosures by auditing queries on the fly in an online setting. Finally, we tie in our work with recent research on query auditing and access control and relate the above definitions of suspiciousness to the notion of unconditional validity of a query introduced in database access control literature.

Proceedings of the 4th International Workshop on Privacy and Anonymity in the Information Society | 2011

Distributing data for secure database services

Vignesh Ganapathy; Dilys Thomas; Tomás Feder; Hector Garcia-Molina; Rajeev Motwani

The advent of database services has resulted in privacy concerns on the part of the client storing data with third party database service providers. Previous approaches to enabling such a service have been based on data encryption, causing a large overhead in query processing. A distributed architecture for secure database services is proposed as a solution to this problem where data is stored at multiple servers. The distributed architecture provides both privacy as well as fault tolerance to the client. In this paper we provide algorithms for (1) distributing data: our results include hardness of approximation results and hence a heuristic greedy algorithm for the distribution problem (2) partitioning the query at the client to queries for the servers is done by a bottom up state based algorithm. Finally the results at the servers are integrated to obtain the answer at the client. We provide an experimental validation and performance study of our algorithms.

international conference on data engineering | 2007

Auditing a Batch of SQL Queries

Rajeev Motwani; Shubha U. Nabar; Dilys Thomas

In this paper, we study the problem of auditing a batch of SQL queries: given a set of SQL queries that have been posed over a database, determine whether some subset of these queries have revealed private information about an individual or group of individuals. In (Agrawal et al., 2004), the authors studied the problem of determining whether any single SQL query in isolation revealed information forbidden by the database systems data disclosure policies. In this paper, we extend this work to the problem of auditing a batch of SQL queries. We define two different notions of auditing-semantic auditing and syntactic auditing -and show that while syntactic auditing seems more desirable, it is in fact NP-hard to achieve. The problem of semantic auditing of a batch of SQL queries is, however, tractable and we give a polynomial time algorithm for this purpose.

ACM Transactions on Algorithms | 2007

Querying priced information in databases: The conjunctive case

Renato Carmo; Tomás Feder; Yoshiharu Kohayakawa; Eduardo Sany Laber; Rajeev Motwani; Liadan O'Callaghan; Rina Panigrahy; Dilys Thomas

Query optimization that involves expensive predicates has received considerable attention in the database community. Typically, the output to a database query is a set of tuples that satisfy certain conditions, and, with expensive predicates, these conditions may be computationally costly to verify. In the simplest case, when the query looks for the set of tuples that simultaneously satisfy k expensive predicates, the problem reduces to ordering the evaluation of the predicates so as to minimize the time to output the set of tuples comprising the answer to the query. We study different cases of the problem: the sequential case, in which a single processor is available to evaluate the predicates, and the distributed case, in which there are k processors available, each dedicated to a different attribute (column) of the database, and there is no communication cost between the processors. For the sequential case, we give a simple and fast deterministic k-approximation algorithm, and prove that k is the best possible approximation ratio for a deterministic algorithm, even if exponential time algorithms are allowed. We also propose a randomized, polynomial time algorithm with expected approximation ratio 1 + &sqrt;2/2 ≈ 1.707 for k = 2, and prove that 3/2 is the best possible expected approximation ratio for randomized algorithms. We also show that given 0 ≤ ϵ ≤ 1, no randomized algorithm achieves approximation ratio smaller than 1 + ϵ with probability larger than (1 + ϵ)/2. For the distributed case, we consider two different models: the preemptive model, in which a processor is allowed to interrupt the evaluation of a predicate, and the nonpreemptive model, in which the evaluation of a predicate must be completed once started. We show that k is the best possible approximation ratio for a deterministic algorithm, even if exponential time algorithms are allowed. For the preemptive model, we introduce a polynomial time k-approximation algorithm. For the nonpreemptive model, we introduce a polynomial time O(k log2 k)-approximation algorithm.

international conference on data engineering | 2012

Architecting the Database Access for a IT Infrastructure and Data Center Monitoring Tool

Pradeep Unde; Harrick M. Vin; Maitreya Natu; Vaishali Kulkarni; Dilys Thomas; Sreeram Vasudevan; Amruta Dhondage; Chinmay Jog; Shivam Sahai; Rekha Pathak

We present our experience in architecting the database access for a tool that analyzes IT infrastructure and data center utilization. We study solutions built on top of TSDB [2] and SQL Server. We provide a configuration based multi threaded approach in SQL Server based on Spring Batch [1]. Our architecture includes spring batch partitioning, JDBC-R bridge rewrite and optimized Transact-SQL procedures.

Explore More