Sumit Ganguly | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sumit Ganguly is active.

Explore More

Publication

Featured researches published by Sumit Ganguly.

international conference on management of data | 1996

Bifocal sampling for skew-resistant join size estimation

Sumit Ganguly; Phillip B. Gibbons; Yossi Matias; Abraham Silberschatz

This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value. Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some well-known problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is Ω(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(√n lg n). By contrast, previous algorithms using a sample of similar size may require the join size to be Ω(n√n) to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective.

international conference on management of data | 1990

A framework for the parallel processing of Datalog queries

Sumit Ganguly; Abraham Silberschatz; Shalom Tsur

This paper presents several complementary methods for the parallel, bottom-up evaluation of Datalog queries. We introduce the notion of a discriminating predicate, based on hash functions, that partitions the computation between the processors in order to achieve parallelism. A parallelization scheme with the property of non-redundant computation (no duplication of computation by processors) is then studied in detail. The mapping of Datalog programs onto a network of processors, such that the results is a non-redundant computation, is also studied. The methods reported in this paper clearly demonstrate the trade-offs between redundancy and interprocessor-communication for this class of problems.

symposium on principles of database systems | 1991

Minimum and maximum predicates in logic programming

Sumit Ganguly; Sergio Greco; Carlo Zaniolo

A novel approach is proposed for ezpresaing and computing eficienily a large cla88 of problem8, including jinding the shortest path in a graph, that were previously considered impervious to an efiient treatment in the declarative framework of logic-baaed languageu. Our approach w based on the u8e of ruin and nmx predicate having a jht-order semantica defined using mleu w“th negation in their bodien. We show that when certain monotonicity condition8 hold then (1) there ezists a total well-founded model for these progmrnn containing negation, (2) this model can be computed eflciently using a procedure called greedy fixpoint, and (3) the original program can be rewritten into a more eficient one by puuhing rnin and max predicate8 into recursion. The greedy jizpoint evaluation of the program expressing the shorted path problem coincideu with Dijkdra’s algon”thm.

extending database technology | 2004

Processing Data-Stream Join Aggregates Using Skimmed Sketches

Sumit Ganguly; Minos N. Garofalakis; Rajeev Rastogi

There is a growing interest in on-line algorithms for analyzing and querying data streams, that examine each stream element only once and have at their disposal, only a limited amount of memory. Providing (perhaps approximate) answers to aggregate queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In this paper, we present the skimmed-sketch algorithm for estimating the join size of two streams. (Our techniques also readily extend to other join-aggregate queries.) To the best of our knowledge, our skimmed-sketch technique is the first comprehensive join-size estimation algorithm to provide tight error guarantees while: (1) achieving the lower bound on the space required by any join-size estimation method in a streaming environment, (2) handling streams containing general update operations (inserts and deletes), (3) incurring a low logarithmic processing time per stream element, and (4) not assuming any a-priori knowledge of the frequency distribution for domain values. Our skimmed-sketch technique achieves all of the above by first skimming the dense frequencies from random hash-sketch summaries of the two streams. It then computes the subjoin size involving only dense frequencies directly, and uses the skimmed sketches only to approximate subjoin sizes for the non-dense frequencies. Results from our experimental study with real-life as well as synthetic data streams indicate that our skimmed-sketch algorithm provides significantly more accurate estimates for join sizes compared to earlier sketch-based techniques.

very large data bases | 2002

Optimizing view queries in ROLEX to support navigable result trees

Philip Bohannon; Sumit Ganguly; Henry F. Korth; P. P. S. Narayan; Pradeep Shenoy

An increasing number of applications use XML data published from relational databases. For speed and convenience, such applications routinely cache this XML data locally and access it through standard navigational interfaces such as DOM, sacrificing the consistency and integrity guarantees provided by a DBMS for speed. The ROLEX system is being built to extend the capabilities of relational database systems to deliver fast, consistent and navigable XML views of relational data to an application via a virtual DOM interface. This interface translates navigation operations on a DOM tree into execution-plan actions, allowing a spectrum of possibilities for lazy materialization. The ROLEX query optimizer uses a characterization of the navigation behavior of an application, and optimizes view queries to minimize the expected cost of that navigation. This paper presents the architecture of ROLEX, including its model of query execution and the query optimizer. We demonstrate with a performance study the advantages of the ROLEX approach and the importance of optimizing query execution for navigation.

very large data bases | 2004

Tracking set-expression cardinalities over continuous update streams

Sumit Ganguly; Minos N. Garofalakis; Rajeev Rastogi

Abstract.There is growing interest in algorithms for processing and querying continuous data streams (i.e., data seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations or processing of retail-chain transactions). Estimating the cardinality of set expressions defined over several (possibly distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask “what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?”. Earlier work only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed ”2-level hash sketch”. We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only space that is significantly sublinear in the sizes of the streaming input (multi-)sets. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Finally, we propose an optimized, time-efficient stream synopsis (based on 2-level hash sketches) that provides similar, strong accuracy-space guarantees while requiring only guaranteed logarithmic maintenance time per update, thus making our methods applicable for truly rapid-rate data streams. Our results from an empirical study of our synopsis and estimation techniques verify the effectiveness of our approach.

symposium on principles of database systems | 1992

Greedy by choice

Sergio Greco; Carlo Zaniolo; Sumit Ganguly

The greedy paradigm of algorithm design is a well known tool used for efficiently solving many classical computational problems within the framework of procedural languages. However, it is very difficult to express these algorithms within the declarative framework of logic-based languages. In this paper, we extend the framework of Datalog-like languages to provide simple and declarative formulations of such problems, with computational complexities comparable to those of procedural formulations. This is achieved through the use of constructs, such as least and choice, that have semantics reducible to that of negative programs under stable model semantics. Therefore, we show that the formulation of greedy algorithms using these constructs lead to a syntactic class of programs, called stage-stratified programs, that are easily recognized at compile time. The fixpoint-based implementation of these recursive programs is very efficient and, combined with suitable storage structures, yields asymptotic complexities comparable to those obtained using procedural languages.

Journal of Logic Programming | 1992

Parallel bottom-up processing of Datalog queries

Sumit Ganguly; Abraham Silberschatz; Shalom Tsur

Abstract This paper presents several complementary methods for the parallel, bottom-up evaluation of Datalog queries. We introduce the notion of a discriminating predicate, based on hash functions, that partitions the computation between the processors in order to achieve parallelism. A parallelization scheme with the property of nonredundant computation (no duplication of computation by processors) is then studied in detail. The mapping of Datalog programs onto a network of processors, such that the result is a nonredundant computation, is also studied.

symposium on principles of database systems | 1996

Efficient and accurate cost models for parallel query optimization (extended abstract)

Sumit Ganguly; Akshay Goel; Avi Silberschatz

Akshay Goel

symposium on principles of database systems | 2002

On the complexity of approximate query optimization

Sourav Chatterji; Sai Surya Kiran Evani; Sumit Ganguly; Mahesh Datt Yemmanuru

In this work, we study the complexity of the problem of approximate query optimization. We show that, for any Δ > 0, the problem of finding a join order sequence whose cost is within a factor 2Θ(log1-Δ(K)) of K, where K is the cost of the optimal join order sequence is NP-Hard. The complexity gap remains if the number of edges in the query graph is constrained to be a given function e(n) of the number of vertices n of the query graph, where n(n - 1)/2 - Θ(nτ) ≥ e(n) ≥ n + Θ(nτ) and τ is any constant between 0 and 1. These results show that, unless P=NP, the query optimization problem cannot be approximately solved by an algorithm that runs in polynomial time and has a competitive ratio that is within some polylogarithmic factor of the optimal cost.

Explore More