Vivek R. Narasayya
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vivek R. Narasayya.
Communications of The ACM | 2011
Surajit Chaudhuri; Umeshwar Dayal; Vivek R. Narasayya
BI technologies are essential to running todays businesses and this technology is going through sea changes.
international conference on management of data | 2004
Sanjay Agrawal; Vivek R. Narasayya; Beverly Yang
In addition to indexes and materialized views, horizontal and vertical partitioning are important aspects of physical design in a relational database system that significantly impact performance. Horizontal partitioning also provides manageability; database administrators often require indexes and their underlying tables partitioned identically so as to make common operations such as backup/restore easier. While partitioning is important, incorporating partitioning makes the problem of automating physical design much harder since: (a) The choices of partitioning can strongly interact with choices of indexes and materialized views. (b) A large new space of physical design alternatives must be considered. (c) Manageability requirements impose a new constraint on the problem. In this paper, we present novel techniques for designing a scalable solution to this integrated physical design problem that takes both performance and manageability into account. We have implemented our techniques and evaluated it on Microsoft SQL Server. Our experiments highlight: (a) the importance of taking an integrated approach to automated physical design and (b) the scalability of our techniques.
international conference on management of data | 1999
Surajit Chaudhuri; Rajeev Motwani; Vivek R. Narasayya
A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. It is not even known whether it is possible to generate a sample of a join tree without first evaluating the join tree completely. We undertake a detailed study of this problem and attempt to analyze it in a variety of settings. We present theoretical results explaining the difficulty of this problem and setting limits on the efficiency that can be achieved. Based on new insights into the interaction between join and sampling, we develop join sampling techniques for the settings where our negative results do not apply. Our new sampling algorithms are significantly more efficient than those known earlier. We present experimental evaluation of our techniques on Microsofts SQL Server 7.0.
international conference on management of data | 1998
Surajit Chaudhuri; Rajeev Motwani; Vivek R. Narasayya
Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions.
international conference on management of data | 1998
Surajit Chaudhuri; Vivek R. Narasayya
As databases get widely deployed, it becomes increasingly important to reduce the overhead of database administration. An important aspect of data administration that critically influences performance is the ability to select indexes for a database. In order to decide the right indexes for a database, it is crucial for the database administrator (DBA) to be able to perform a quantitative analysis of the existing indexes. Furthermore, the DBA should have the ability to propose hypothetical (“what-if”) indexes and quantitatively analyze their impact on performance of the system. Such impact analysis may consist of analyzing workloads over the database, estimating changes in the cost of a workload, and studying index usage while taking into account projected changes in the sizes of the database tables. In this paper we describe a novel index analysis utility that we have prototyped for Microsoft SQL Server 7.0. We describe the interfaces exposed by this utility that can be leveraged by a variety of front-end tools and sketch important aspects of the user interfaces enabled by the utility. We also discuss the implementation techniques for efficiently supporting “what-if” indexes. Our framework can be extended to incorporate analysis of other aspects of physical database design.
symposium on principles of database systems | 2000
Moses Charikar; Surajit Chaudhuri; Rajeev Motwani; Vivek R. Narasayya
We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.
international conference on data engineering | 2001
Surajit Chaudhuri; Gautam Das; Mayur Datar; Rajeev Motwani; Vivek R. Narasayya
Studies the problem of approximately answering aggregation queries using sampling. We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is skewed. To address this issue, we introduce a technique called outlier indexing. Uniform sampling is also ineffective for queries with low selectivity. We rely on weighted sampling based on workload information to overcome this shortcoming. We demonstrate that a combination of outlier indexing with weighted sampling can be used to answer aggregation queries with a significantly reduced approximation error compared to either uniform sampling or weighted sampling alone. We discuss the implementation of these techniques on Microsofts SQL Server and present experimental results that demonstrate the merits of our techniques.
ACM Transactions on Database Systems | 2007
Surajit Chaudhuri; Gautam Das; Vivek R. Narasayya
The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem where, given a workload of queries, we select a stratified random sample of the original data such that the error in answering the workload queries using the sample is minimized. A key novelty of our approach is that we can tailor the choice of samples to be robust, even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server that demonstrate the superior quality of our method compared to previous work.
very large data bases | 2004
Sanjay Agrawal; Surajit Chaudhuri; Lubor J. Kollar; Arunprasad P. Marathe; Vivek R. Narasayya; Manoj Syamala
Publisher Summary This chapter provides an overview of Database Tuning Advisors (DTAs) novel functionality, the rationale for its architecture, and demonstrates DTAs quality and scalability on large customer workloads. The DTA is part of Microsoft SQL Server 2005. It is an automated physical database design tool that significantly advances the state-of-the-art in several ways. First, the DTA is capable of providing an integrated physical design recommendation for horizontal partitioning, indexes, and materialized views. Second, unlike todays physical design tools that focus solely on performance, the DTA also supports the capability for a database administrator (DBA) to specify manageability requirements while optimizing for performance. Third, the DTA is able to scale to large databases and workloads using several novel techniques including: workload compression, reduced statistics creation, and exploiting test server to reduce load on production server. Finally, the DTA greatly enhances scriptability and customization through the use of a public XML schema for input and output.
international conference on management of data | 2001
Surajit Chaudhuri; Gautam Das; Vivek R. Narasayya
The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem whose goal is to minimize the error in answering queries in the given workload. A key novelty of our approach is that we can tailor the choice of samples to be robust even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server 2000 that demonstrate the superior quality of our method compared to previous work.