David J. DeWitt
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David J. DeWitt.
international conference on management of data | 2000
Jianjun Chen; David J. DeWitt; Feng Tian; Yuan Wang
Continuous queries are persistent queries that allow users to receive new results when they become available. While continuous query systems can transform a passive web into an active environment, they need to be able to support millions of queries due to the scale of the Internet. No existing systems have achieved this level of scalability. NiagaraCQ addresses this problem by grouping continuous queries based on the observation that many web queries share similar structures. Grouped queries can share the common computation, tend to fit in memory and can reduce the I/O cost significantly. Furthermore, grouping on selection predicates can eliminate a large number of unnecessary query invocations. Our grouping technique is distinguished from previous group optimization approaches in the following ways. First, we use an incremental group optimization strategy with dynamic re-grouping. New queries are added to existing query groups, without having to regroup already installed queries. Second, we use a query-split scheme that requires minimal changes to a general-purpose query engine. Third, NiagaraCQ groups both change-based and timer-based queries in a uniform way. To insure that NiagaraCQ is scalable, we have also employed other techniques including incremental evaluation of continuous queries, use of both pull and push models for detecting heterogeneous data source changes, and memory caching. This paper presents the design of NiagaraCQ system and gives some experimental results on the systems performance and scalability.
international conference on management of data | 2009
Andrew Pavlo; Erik Paulson; Alexander Rasin; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Michael Stonebraker
There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each systems performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
international conference on management of data | 2001
Chun Zhang; Jeffrey F. Naughton; David J. DeWitt; Qiong Luo; Guy M. Lohman
Virtually all proposals for querying XML include a class of query we term “containment queries”. It is also clear that in the foreseeable future, a substantial amount of XML data will be stored in relational database systems. This raises the question of how to support these containment queries. The inverted list technology that underlies much of Information Retrieval is well-suited to these queries, but should we implement this technology (a) in a separate loosely-coupled IR engine, or (b) using the native tables and query execution machinery of the RDBMS? With option (b), more than twenty years of work on RDBMS query optimization, query execution, scalability, and concurrency control and recovery immediately extend to the queries and structures that implement these new operations. But all this will be irrelevant if the performance of option (b) lags that of (a) by too much. In this paper, we explore some performance implications of both options using native implementations in two commercial relational database systems and in a special purpose inverted list engine. Our performance study shows that while RDBMSs are generally poorly suited for such queries, under conditions they can outperform an inverted list engine. Our analysis further identifies two significant causes that differentiate the performance of the IR and RDBMS implementations: the join algorithms employed and the hardware cache utilization. Our results suggest that contrary to most expectations, with some modifications, a native implementations in an RDBMS can support this class of query much more efficiently.
international conference on management of data | 1984
David J. DeWitt; Randy H. Katz; Frank Olken; Leonard D. Shapiro; Michael Stonebraker; David A. Wood
With the availability of very large, relatively inexpensive main memories, it is becoming possible keep large databases resident in main memory In this paper we consider the changes necessary to permit a relational database system to take advantage of large amounts of main memory We evaluate AVL vs B+-tree access methods for main memory databases, hash-based query processing strategies vs sort-merge, and study recovery issues when most or all of the database fits in main memory As expected, B+-trees are the preferred storage mechanism unless more than 80--90% of the database fits in main memory A somewhat surprising result is that hash based query processing strategies are advantageous for large memory situations
IEEE Transactions on Knowledge and Data Engineering | 1990
David J. DeWitt; Shahram Ghandeharizadeh; Donovan A. Schneider; Allan Bricker; Hui-I Hsiao; Rick Rasmussen
The design of the Gamma database machine and the techniques employed in its implementation are described. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the architecture to be scaled to hundreds of processors. First, all relations are horizontally partitioned across multiple disk drives, enabling relations to be scanned in parallel. Second, parallel algorithms based on hashing are used to implement the complex relational operators, such as join and aggregate functions. Third, dataflow scheduling techniques are used to coordinate multioperator queries. By using these techniques, it is possible to control the execution of very complex queries with minimal coordination. The design of the Gamma software is described and a thorough performance evaluation of the iPSC/s hypercube version of Gamma is presented. >
international conference on management of data | 1994
Michael J. Carey; David J. DeWitt; Michael J. Franklin; Nancy Hall; Mark L. McAuliffe; Jeffrey F. Naughton; Daniel T. Schuh; Marvin H. Solomon; C. K. Tan; Odysseas G. Tsatalos; Seth J. White; Michael J. Zwilling
SHORE (Scalable Heterogeneous Object REpository) is a persistent object system under development at the University of Wisconsin. SHORE represents a merger of object-oriented database and file system technologies. In this paper we give the goals and motivation for SHORE, and describe how SHORE provides features of both technologies. We also describe some novel aspects of the SHORE architecture, including a symmetric peer-to-peer server architecture, server customization through an extensible value-added server facility, and support for scalability on multiprocessor systems. An initial version of SHORE is already operational, and we expect a release of Version 1 in mid-1994.
international conference on management of data | 2005
Jim Gray; David T. Liu; Maria A. Nieto-santisteban; Alexander S. Szalay; David J. DeWitt; Gerd Heber
Scientific instruments and computer simulations are creating vast data stores that require new scientific methods to analyze and organize the data. Data volumes are approximately doubling each year. Since these new instruments have extraordinary precision, the data quality is also rapidly improving. Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.
Communications of The ACM | 2010
Michael Stonebraker; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Erik Paulson; Andrew Pavlo; Alexander Rasin
MapReduce complements DBMSs since databases are not designed for extract-transform-load tasks, a MapReduce specialty.
international conference on management of data | 1989
Donovan A. Schneider; David J. DeWitt
In this paper we analyze and compare four parallel join algorithms. Grace and Hybrid hash represent the class of hash-based join methods, Simple hash represents a looping algorithm with hashing, and our last algorithm is the more traditional sort-merge. The performance of each of the algorithms with different tuple distribution policies, the addition of bit vector filters, varying amounts of main-memory for joining, and non-uniformly distributed join attribute values is studied. The Hybrid hash-join algorithm is found to be superior except when the join attribute values of the inner relation are non-uniformly distributed and memory is limited. In this case, a more conservative algorithm such as the sort-merge algorithm should be used. The Gamma database machine serves as the host for the performance comparison.
international conference on data engineering | 2003
Yuan Wang; David J. DeWitt; Jin-yi Cai
XML has become the de facto standard format for Web publishing and data transportation. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change detection on XML, or other hierarchically structured documents, used an ordered tree model, in which left-to-right order among siblings is important and it can affect the change result. We argue that an unordered model (only ancestor relationships are significant) is more suitable for most database applications. Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate. We propose X-Diff, an effective algorithm that integrates key XML structure characteristics with standard tree-to-tree correction techniques. The algorithm is analyzed and compared with XyDiff [CAM02], a published XML diff algorithm. An experimental evaluation on both algorithms is provided.