Sudeepa Roy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sudeepa Roy is active.

Explore More

Publication

Featured researches published by Sudeepa Roy.

formal methods | 2006

Tool for translating simulink models into input language of a model checker

B. Meenakshi; Abhishek Bhatnagar; Sudeepa Roy

Model Based Development (MBD) using Mathworks tools like Simulink, Stateflow etc. is being pursued in Honeywell for the development of safety critical avionics software. Formal verification techniques are well-known to identify design errors of safety critical systems reducing development cost and time. As of now, formal verification of Simulink design models is being carried out manually resulting in excessive time consumption during the design phase. We present a tool that automatically translates certain Simulink models into input language of a suitable model checker. Formal verification of safety critical avionics components becomes faster and less error prone with this tool. Support is also provided for reverse translation of traces violating requirements (as given by the model checker) into Simulink notation for playback.

international conference on management of data | 2014

A formal approach to finding explanations for database queries

Sudeepa Roy; Dan Suciu

As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level information from datasets collected from various sources and combined using data integration techniques. A major challenge for research in data management is to develop tools to assist users in explaining observed query outputs. In this paper we introduce a principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers. We provide a formal definition of intervention in the presence of multiple relations which can interact with each other through foreign keys. First we give a set of recursive rules to compute the intervention for any given explanation in polynomial time (data complexity). Then we give simple and efficient algorithms based on SQL queries that can compute the top-K explanations by using standard database management systems under certain conditions. We evaluate the quality and performance of our approach by experiments on real datasets.

very large data bases | 2014

Causality and explanations in databases

Alexandra Meliou; Sudeepa Roy; Dan Suciu

With the surge in the availability of information, there is a great demand for tools that assist users in understanding their data. While todays exploration tools rely mostly on data visualization, users often want to go deeper and understand the underlying causes of a particular observation. This tutorial surveys research on causality and explanation for data-oriented applications. We will review and summarize the research thus far into causality and explanation in the database and AI communities, giving researchers a snapshot of the current state of the art on this topic, and propose a unified framework as well as directions for future research. We will cover both the theory of causality/explanation and some applications; we also discuss the connections with other topics in database research like provenance, deletion propagation, why-not queries, and OLAP techniques.

international conference on database theory | 2011

Faster query answering in probabilistic databases using read-once functions

Sudeepa Roy; Vittorio Perduca; Val Tannen

A boolean expression is in read-once form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in read-once form can be computed in polynomial time (whereas the general problem for arbitrary expressions is #P-complete). Known approaches to checking read-once property seem to require putting these expressions in disjunctive normal form. In this paper, we tell a better story for a large subclass of boolean event expressions: those that are generated by conjunctive queries without self-joins and on tuple-independent probabilistic databases. We first show that given a tuple-independent representation and the provenance graph of an SPJ query plan without self-joins, we can, without using the DNF of a result event expression, efficiently compute its co-occurrence graph. From this, the read-once form can already, if it exists, be computed efficiently using existing techniques. Our second and key contribution is a complete, efficient, and simple to implement algorithm for computing the read-once forms (whenever they exist) directly, using a new concept, that of co-table graph, which can be significantly smaller than the cooccurrence graph.

international conference on management of data | 2010

An optimal labeling scheme for workflow provenance using skeleton labels

Zhuowei Bao; Susan B. Davidson; Sanjeev Khanna; Sudeepa Roy

We develop a compact and efficient reachability labeling scheme for answering provenance queries on workflow runs that conform to a given specification. Even though a workflow run can be structurally more complex and can be arbitrarily larger than the specification due to fork (parallel) and loop executions, we show that a compact reachability labeling for a run can be efficiently computed using the fact that it originates from a fixed specification. Our labeling scheme is optimal in the sense that it uses labels of logarithmic length, runs in linear time, and answers any reachability query in constant time. Our approach is based on using the reachability labeling for the specification as an effective skeleton for designing the reachability labeling for workflow runs. We also demonstrate empirically the effectiveness of our skeleton-based labeling approach.

international conference on management of data | 2014

Top-k and Clustering with Noisy Comparisons

Susan B. Davidson; Sanjeev Khanna; Tova Milo; Sudeepa Roy

We study the problems of max/top-k and clustering when the comparison operations may be performed by oracles whose answer may be erroneous. Comparisons may either be of type or of value: given two data elements, the answer to a type comparison is “yes” if the elements have the same type and therefore belong to the same group (cluster); the answer to a value comparison orders the two data elements. We give efficient algorithms that are guaranteed to achieve correct results with high probability, analyze the cost of these algorithms in terms of the total number of comparisons (i.e., using a fixed-cost model), and show that they are essentially the best possible. We also show that fewer comparisons are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases. Finally, we examine another important class of cost functions, concave functions, which balances the number of rounds of interaction with the oracle with the number of questions asked of the oracle. Results of this article form an important first step in providing a formal basis for max/top-k and clustering queries in crowdsourcing applications, that is, when the oracle is implemented using the crowd. We explain what simplifying assumptions are made in the analysis, what results carry to a generalized crowdsourcing setting, and what extensions are required to support a full-fledged model.

international conference on management of data | 2010

Privacy issues in scientific workflow provenance

Susan B. Davidson; Sanjeev Khanna; Sudeepa Roy; Sarah Cohen Boulakia

A scientific workflow often deals with proprietary modules as well as private or confidential data, such as health or medical information. Hence providing exact answers to provenance queries over all executions of the workflow may reveal private information. In this paper we first study the potential privacy issues in a scientific workflow -- module privacy, data privacy, and provenance privacy, and frame several natural questions: (i) can we formally analyze module, data or provenance privacy giving provable privacy guarantees for an unlimited/bounded number of provenance queries? (ii) how can we answer provenance queries, providing as much information as possible to the user while still guaranteeing the required privacy? Then we look at module privacy in detail and propose a formal model from our recent work in [11]. Finally we point to several directions for future work.

international conference on database theory | 2013

A propagation model for provenance views of public/private workflows

Susan B. Davidson; Tova Milo; Sudeepa Roy

We study the problem of concealing functionality of a proprietary or private module when provenance information is shown over repeated executions of a workflow which contains both public and private modules. Our approach is to use provenance views to hide carefully chosen subsets of data over all executions of the workflow to ensure Γ-privacy: for each private module and each input x, the modules output f(x) is indistinguishable from Γ--1 other possible values given the visible data in the workflow executions. We show that Γ-privacy cannot be achieved simply by combining solutions for individual private modules; data hiding must also be propagated through public modules. We then examine how much additional data must be hidden and when it is safe to stop propagating data hiding. The answer depends strongly on the workflow topology as well as the behavior of public modules on the visible data. In particular, for a class of workflows (which include the common tree and chain workflows), taking private solutions for each private module, augmented with a public closure that is upstream-downstream safe, ensures Γ-privacy. We define these notions formally and show that the restrictions are necessary. We also study the related optimization problems of minimizing the amount of hidden data.

ACM Transactions on Database Systems | 2017

Exact Model Counting of Query Expressions: Limitations of Propositional Methods

Paul Beame; Jerry Li; Sudeepa Roy; Dan Suciu

We prove exponential lower bounds on the running time of the state-of-the-art exact model counting algorithms—algorithms for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas. These algorithms can be seen, either directly or indirectly, as building Decision-Decomposable Negation Normal Form (decision-DNNF) representations of the input Boolean formulas. Decision-DNNFs are a special case of d-DNNFs where d stands for deterministic. We show that any knowledge compilation representations from a class (called DLDDs in this article) that contain decision-DNNFs can be converted into equivalent Free Binary Decision Diagrams (FBDDs), also known as Read-Once Branching Programs, with only a quasi-polynomial increase in representation size. Leveraging known exponential lower bounds for FBDDs, we then obtain similar exponential lower bounds for decision-DNNFs, which imply exponential lower bounds for model-counting algorithms. We also separate the power of decision-DNNFs from d-DNNFs and a generalization of decision-DNNFs known as AND-FBDDs. We then prove new lower bounds for FBDDs that yield exponential lower bounds on the running time of these exact model counters when applied to the problem of query evaluation in tuple-independent probabilistic databases—computing the probability of an answer to a query given independent probabilities of the individual tuples in a database instance. This approach to the query evaluation problem, in which one first obtains the lineage for the query and database instance as a Boolean formula and then performs weighted model counting on the lineage, is known as grounded inference. A second approach, known as lifted inference or extensional query evaluation, exploits the high-level structure of the query as a first-order formula. Although it has been widely believed that lifted inference is strictly more powerful than grounded inference on the lineage alone, no formal separation has previously been shown for query evaluation. In this article, we show such a formal separation for the first time. In particular, we exhibit a family of database queries for which polynomial-time extensional query evaluation techniques were previously known but for which query evaluation via grounded inference using the state-of-the-art exact model counters requires exponential time.

international conference on management of data | 2013

Provenance-based dictionary refinement in information extraction

Sudeepa Roy; Laura Chiticariu; Vitaly Feldman; Frederick R. Reiss; Huaiyu Zhu

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

Explore More