David J. DeWitt | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David J. DeWitt is active.

Explore More

Publication

Featured researches published by David J. DeWitt.

international conference on management of data | 2000

NiagaraCQ: a scalable continuous query system for Internet databases

Jianjun Chen; David J. DeWitt; Feng Tian; Yuan Wang

Continuous queries are persistent queries that allow users to receive new results when they become available. While continuous query systems can transform a passive web into an active environment, they need to be able to support millions of queries due to the scale of the Internet. No existing systems have achieved this level of scalability. NiagaraCQ addresses this problem by grouping continuous queries based on the observation that many web queries share similar structures. Grouped queries can share the common computation, tend to fit in memory and can reduce the I/O cost significantly. Furthermore, grouping on selection predicates can eliminate a large number of unnecessary query invocations. Our grouping technique is distinguished from previous group optimization approaches in the following ways. First, we use an incremental group optimization strategy with dynamic re-grouping. New queries are added to existing query groups, without having to regroup already installed queries. Second, we use a query-split scheme that requires minimal changes to a general-purpose query engine. Third, NiagaraCQ groups both change-based and timer-based queries in a uniform way. To insure that NiagaraCQ is scalable, we have also employed other techniques including incremental evaluation of continuous queries, use of both pull and push models for detecting heterogeneous data source changes, and memory caching. This paper presents the design of NiagaraCQ system and gives some experimental results on the systems performance and scalability.

international conference on management of data | 2009

A comparison of approaches to large-scale data analysis

Andrew Pavlo; Erik Paulson; Alexander Rasin; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Michael Stonebraker

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each systems performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

international conference on management of data | 2001

On supporting containment queries in relational database management systems

Chun Zhang; Jeffrey F. Naughton; David J. DeWitt; Qiong Luo; Guy M. Lohman

Virtually all proposals for querying XML include a class of query we term “containment queries”. It is also clear that in the foreseeable future, a substantial amount of XML data will be stored in relational database systems. This raises the question of how to support these containment queries. The inverted list technology that underlies much of Information Retrieval is well-suited to these queries, but should we implement this technology (a) in a separate loosely-coupled IR engine, or (b) using the native tables and query execution machinery of the RDBMS? With option (b), more than twenty years of work on RDBMS query optimization, query execution, scalability, and concurrency control and recovery immediately extend to the queries and structures that implement these new operations. But all this will be irrelevant if the performance of option (b) lags that of (a) by too much. In this paper, we explore some performance implications of both options using native implementations in two commercial relational database systems and in a special purpose inverted list engine. Our performance study shows that while RDBMSs are generally poorly suited for such queries, under conditions they can outperform an inverted list engine. Our analysis further identifies two significant causes that differentiate the performance of the IR and RDBMS implementations: the join algorithms employed and the hardware cache utilization. Our results suggest that contrary to most expectations, with some modifications, a native implementations in an RDBMS can support this class of query much more efficiently.

international conference on management of data | 1984

Implementation techniques for main memory database systems

David J. DeWitt; Randy H. Katz; Frank Olken; Leonard D. Shapiro; Michael Stonebraker; David A. Wood

With the availability of very large, relatively inexpensive main memories, it is becoming possible keep large databases resident in main memory In this paper we consider the changes necessary to permit a relational database system to take advantage of large amounts of main memory We evaluate AVL vs B+-tree access methods for main memory databases, hash-based query processing strategies vs sort-merge, and study recovery issues when most or all of the database fits in main memory As expected, B+-trees are the preferred storage mechanism unless more than 80--90% of the database fits in main memory A somewhat surprising result is that hash based query processing strategies are advantageous for large memory situations

IEEE Transactions on Knowledge and Data Engineering | 1990

The Gamma database machine project

David J. DeWitt; Shahram Ghandeharizadeh; Donovan A. Schneider; Allan Bricker; Hui-I Hsiao; Rick Rasmussen

The design of the Gamma database machine and the techniques employed in its implementation are described. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the architecture to be scaled to hundreds of processors. First, all relations are horizontally partitioned across multiple disk drives, enabling relations to be scanned in parallel. Second, parallel algorithms based on hashing are used to implement the complex relational operators, such as join and aggregate functions. Third, dataflow scheduling techniques are used to coordinate multioperator queries. By using these techniques, it is possible to control the execution of very complex queries with minimal coordination. The design of the Gamma software is described and a thorough performance evaluation of the iPSC/s hypercube version of Gamma is presented. >

international conference on management of data | 1994

Shoring up persistent applications

Michael J. Carey; David J. DeWitt; Michael J. Franklin; Nancy Hall; Mark L. McAuliffe; Jeffrey F. Naughton; Daniel T. Schuh; Marvin H. Solomon; C. K. Tan; Odysseas G. Tsatalos; Seth J. White; Michael J. Zwilling

SHORE (Scalable Heterogeneous Object REpository) is a persistent object system under development at the University of Wisconsin. SHORE represents a merger of object-oriented database and file system technologies. In this paper we give the goals and motivation for SHORE, and describe how SHORE provides features of both technologies. We also describe some novel aspects of the SHORE architecture, including a symmetric peer-to-peer server architecture, server customization through an extensible value-added server facility, and support for scalability on multiprocessor systems. An initial version of SHORE is already operational, and we expect a release of Version 1 in mid-1994.

international conference on management of data | 2005

Scientific data management in the coming decade

Jim Gray; David T. Liu; Maria A. Nieto-santisteban; Alexander S. Szalay; David J. DeWitt; Gerd Heber

Scientific instruments and computer simulations are creating vast data stores that require new scientific methods to analyze and organize the data. Data volumes are approximately doubling each year. Since these new instruments have extraordinary precision, the data quality is also rapidly improving. Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.

Communications of The ACM | 2010

MapReduce and parallel DBMSs: friends or foes?

Michael Stonebraker; Daniel J. Abadi; David J. DeWitt; Samuel Madden; Erik Paulson; Andrew Pavlo; Alexander Rasin

MapReduce complements DBMSs since databases are not designed for extract-transform-load tasks, a MapReduce specialty.

international conference on management of data | 1989

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

Donovan A. Schneider; David J. DeWitt

In this paper we analyze and compare four parallel join algorithms. Grace and Hybrid hash represent the class of hash-based join methods, Simple hash represents a looping algorithm with hashing, and our last algorithm is the more traditional sort-merge. The performance of each of the algorithms with different tuple distribution policies, the addition of bit vector filters, varying amounts of main-memory for joining, and non-uniformly distributed join attribute values is studied. The Hybrid hash-join algorithm is found to be superior except when the join attribute values of the inner relation are non-uniformly distributed and memory is limited. In this case, a more conservative algorithm such as the sort-merge algorithm should be used. The Gamma database machine serves as the host for the performance comparison.

international conference on data engineering | 2003

X-Diff: an effective change detection algorithm for XML documents

Yuan Wang; David J. DeWitt; Jin-yi Cai

XML has become the de facto standard format for Web publishing and data transportation. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change detection on XML, or other hierarchically structured documents, used an ordered tree model, in which left-to-right order among siblings is important and it can affect the change result. We argue that an unordered model (only ancestor relationships are significant) is more suitable for most database applications. Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate. We propose X-Diff, an effective algorithm that integrates key XML structure characteristics with standard tree-to-tree correction techniques. The algorithm is analyzed and compared with XyDiff [CAM02], a published XML diff algorithm. An experimental evaluation on both algorithms is provided.

Explore More