Elizabeth J. O'Neil
University of Massachusetts Boston
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Elizabeth J. O'Neil.
international conference on management of data | 1993
Elizabeth J. O'Neil; Patrick E. O'Neil; Gerhard Weikum
This paper introduces a new approach to database disk buffering, called the LRU-K method. The basic idea of LRU-K is to keep track of the times of the last K references to popular database pages, using this information to statistically estimate the interarrival times of references on a page by page basis. Although the LRU-K approach performs optimal statistical inference under relatively standard assumptions, it is fairly simple and incurs little bookkeeping overhead. As we demonstrate with simulation experiments, the LRU-K algorithm surpasses conventional buffering algorithms in discriminating between frequently and infrequently referenced pages. In fact, LRU-K can approach the behavior of buffering algorithms in which page sets with known access frequencies are manually assigned to different buffer pools of specifically tuned sizes. Unlike such customized buffering algorithms however, the LRU-K method is self-tuning, and does not rely on external hints about workload characteristics. Furthermore, the LRU-K algorithm adapts in real time to changing patterns of access.
international conference on management of data | 1995
Hal Berenson; Philip A. Bernstein; Jim Gray; Jim Melton; Elizabeth J. O'Neil; Patrick E. O'Neil
ANSI SQL-92 [MS, ANSI] defines Isolation Levels in terms of phenomena: Dirty Reads, Non-Repeatable Reads, and Phantoms. This paper shows that these phenomena and the ANSI SQL definitions fail to properly characterize several popular isolation levels, including the standard locking implementations of the levels covered. Ambiguity in the statement of the phenomena is investigated and a more formal statement is arrived at; in addition new phenomena that better characterize isolation types are introduced. Finally, an important multiversion isolation type, called Snapshot Isolation, is defined.
Acta Informatica | 1996
Patrick E. O'Neil; Edward Y. C. Cheng; Dieter Gawlick; Elizabeth J. O'Neil
High-performance transaction system applications typically insert rows in a History table to provide an activity trace; at the same time the transaction system generates log records for purposes of system recovery. Both types of generated information can benefit from efficient indexing. An example in a well-known setting is the TPC-A benchmark application, modified to support efficient queries on the history for account activity for specific accounts. This requires an index by account-id on the fast-growing History table. Unfortunately, standard disk-based index structures such as the B-tree will effectively double the I/O cost of the transaction to maintain an index such as this in real time, increasing the total system cost up to fifty percent. Clearly a method for maintaining a real-time index at low cost is desirable. The log-structured mergetree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components. The algorithm has greatly reduced disk arm movements compared to a traditional access methods such as B-trees, and will improve cost-performance in domains where disk arm costs for inserts with traditional access methods overwhelm storage media costs. The LSM-tree approach also generalizes to operations other than insert and delete. However, indexed finds requiring immediate response will lose I/O efficiency in some cases, so the LSM-tree is most useful in applications where index inserts are more common than finds that retrieve the entries. This seems to be a common property for history tables and log files, for example. The conclusions of Sect. 6 compare the hybrid use of memory and disk components in the LSM-tree access method with the commonly understood advantage of the hybrid method to buffer disk pages in memory.
ACM Transactions on Database Systems | 2005
Alan Fekete; Dimitrios Liarokapis; Elizabeth J. O'Neil; Patrick E. O'Neil; Dennis E. Shasha
Snapshot Isolation (SI) is a multiversion concurrency control algorithm, first described in Berenson et al. [1995]. SI is attractive because it provides an isolation level that avoids many of the common concurrency anomalies, and has been implemented by Oracle and Microsoft SQL Server (with certain minor variations). SI does not guarantee serializability in all cases, but the TPC-C benchmark application [TPC-C], for example, executes under SI without serialization anomalies. All major database system products are delivered with default nonserializable isolation levels, often ones that encounter serialization anomalies more commonly than SI, and we suspect that numerous isolation errors occur each day at many large sites because of this, leading to corrupt data sometimes noted in data warehouse applications. The classical justification for lower isolation levels is that applications can be run under such levels to improve efficiency when they can be shown not to result in serious errors, but little or no guidance has been offered to application programmers and DBAs by vendors as to how to avoid such errors. This article develops a theory that characterizes when nonserializable executions of applications can occur under SI. Near the end of the article, we apply this theory to demonstrate that the TPC-C benchmark application has no serialization anomalies under SI, and then discuss how this demonstration can be generalized to other applications. We also present a discussion on how to modify the program logic of applications that are nonserializable under SI so that serializability will be guaranteed.
international conference on management of data | 2008
Elizabeth J. O'Neil
Object/Relational Mapping (ORM) provides a methodology and mechanism for object-oriented systems to hold their long-term data safely in a database, with transactional control over it, yet have it expressed when needed in program objects. Instead of bundles of special code for this, ORM encourages models and use of constraints for the application, which then runs in a context set up by the ORM. Todays web applications are particularly well-suited to this approach, as they are necessarily multithreaded and thus are prone to race conditions unless the interaction with the database is very carefully implemented. The ORM approach was first realized in Hibernate, an open source project for Java systems started in 2002, and this year is joined by Microsofts Entity Data Model for .NET systems. Both are described here.
international conference on management of data | 2004
Alan Fekete; Elizabeth J. O'Neil; Patrick E. O'Neil
Snapshot Isolation (SI), is a multi-version concurrency control algorithm introduced in [BBGMOO95] and later implemented by Oracle. SI avoids many concurrency errors, and it never delays read-only transactions. However it does not guarantee serializability. It has been widely assumed that, under SI, read-only transactions always execute serializably provided the concurrent update transactions are serializable. The reason for this is that all SI reads return values from a single instant of time when all committed transactions have completed their writes and no writes of non-committed transactions are visible. This seems to imply that read-only transactions will not read anomalous results so long as the update transactions with which they execute do not write such results. In the current note, however, we exhibit an example contradicting these assumptions: it is possible for an SI history to be non-serializable while the sub-history containing all update transactions is serializable.
international database engineering and applications symposium | 2007
Elizabeth J. O'Neil; Patrick E. O'Neil; Kesheng Wu
Historically, bitmap indexing has provided an important database capability to accelerate queries. However, only a few database systems have implemented these indexes because of the difficulties of modifying fundamental assumptions in the low- level design of a database system and in the expectations of customers, both of which have developed in an environment that does not support bitmap indexes. Another problem that arises, and one that may more easily be addressed by a research article, is that there is no definitive design for bitmap indexes; bitmap index designs in Oracle, Sybase IQ, Vertica and MODEL 204 are idiosyncratic, and some of them were designed for older machine architectures. To investigate an efficient design on modern processors, this paper provides details of the Set Query benchmark and a comparison of two research implementations of bitmap indexes. One, called RIDBit, uses the N-ary storage model to organize table rows, and implements a strategy that gracefully switches between the well-known B-tree RID-list structure and a bitmap structure. The other, called FastBit is based on vertical organization of the table data, where all columns are individually stored. It implements a compressed bitmap index, with a linear organization of the bitmaps to optimize disk accesses. Through this comparison, we evaluate the pros and cons of various design choices. Our analysis adds a number of subtleties to the conventional indexing wisdom commonly quoted in the database community.
international conference on management of data | 2001
Denis Rinfret; Patrick E. O'Neil; Elizabeth J. O'Neil
The bit-sliced index (BSI) was originally defined in [ONQ97]. The current paper introduces the concept of BSI arithmetic. For any two BSIs X and Y on a table T, we show how to efficiently generate new BSIs Z, V, and W, such that Z = X + Y, V = X - Y, and W = MIN(X, Y); this means that if a row r in T has a value x represented in BSI X and a value y in BSI Y, the value for r in BSI Z will be x + y, the value in V will be x - y and the value in W will be MIN(x, y). Since a bitmap representing a set of rows is the simplest bit-sliced index, BSI arithmetic is the most straightforward way to determine multisets of rows (with duplicates) resulting from the SQL clauses UNION ALL (addition), EXCEPT ALL (subtraction), and INTERSECT ALL (min) (see [OO00, DB2SQL] for definitions of these clauses). Another contribution of the current paper is to generalize BSI range restrictions from [ONQ97] to a new non-Boolean form: to determine the top k BSI-valued rows, for ally meaningful value k between one and the total number of rows in T. Together with bit-sliced addition, this permits us to solve a common basic problem of text retrieval: given an object-relational table T of rows representing documents, with a collection type column K representing keyword terms, we demonstrate an efficient algorithm to find k documents that share the largest number of terms with some query list Q of terms. A great deal of published work on such problems exists in the Information Retrieval (IR) field. The algorithm we introduce, which we call Bit-Sliced Term-Matching, or BSTM, uses an approach comparable in performance to the most efficient known IR algorithm, a major improvement on current DBMS text searching algorithms, with the advantage that it uses only indexing we propose for native database operations.
international conference on data engineering | 2011
Stephen Revilak; Patrick E. O'Neil; Elizabeth J. O'Neil
Many popular database management systems provide snapshot isolation (SI) for concurrency control, either in addition to or in place of full serializability based on locking. Snapshot isolation was introduced in 1995 [2], with noted anomalies that can lead to serializability violations. Full serializability was provided in 2008 [4] and improved in 2009 [5] by aborting transactions in dangerous structures, which had been shown in 2005 [9] to be precursors to potential SI anomalies. This approach resulted in a runtime environment guaranteeing a serializable form of snapshot isolation (which we call SSI [4] or ESSI [5]) for arbitrary applications. But transactions in a dangerous structure frequently do not cause true anomalies so, as the authors point out, their method is conservative: it can cause unnecessary aborts. In the current paper, we demonstrate our PSSI algorithm to detect cycles in a snapshot isolation dependency graph and abort transactions to break the cycle. This algorithm provides a much more precise criterion to perform aborts. We have implemented our algorithm in an open source production database system (MySQL/InnoDB), and our performance study shows that PSSI throughput improves on ESSI, with significantly fewer aborts.
Information & Computation | 1973
Patrick E. O'Neil; Elizabeth J. O'Neil
A probabilistic algorithm is presented to calculate the Boolean product of two n × n Boolean matrices using an expected number of elementary operations of 0(n2). Asymptotically in n, almost all pairs of matrices may be multiplied using this algorithm in 0(n2+e) elementary operations for any e > 0.