Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Lisa Spainhower is active.

Publication


Featured researches published by Lisa Spainhower.


Ibm Journal of Research and Development | 1999

IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

Lisa Spainhower; Thomas A. Gregg

Fault tolerance in IBM S/390® systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex® in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.


integrated network management | 2003

Generic on-line discovery of quantitative models for service level management

Yixin Diao; Frank Eskesen; Steven E. Froehlich; Joseph L. Hellerstein; Alexander Keller; Lisa Spainhower; Maheswaran Surendra

Quantitative models are needed for a variety of management tasks, including (a) identification of critical variables to use for health monitoring, (b) anticipating service level violations by using predictive models, and (c) on-going optimization of configurations. Unfortunately, constructing quantitative models requires specialized skills that are in short supply. Even worse, rapid changes in provider configurations and the evolution of business demands mean that quantitative models must be updated on an on-going basis. This paper describes an architecture and algorithms for on-line discovery of quantitative models without prior knowledge of the managed elements. The architecture makes use of an element schema that describes managed elements using the common information model (CIM). Algorithms are presented for selecting a subset of the element metrics to use as explanatory variables in a quantitative model and for constructing the quantitative model itself. We further describe a prototype system based on this architecture that incorporates these algorithms. We apply the prototype to on-line estimation of response times for DB2 Universal Database under a TPC-W workload. Of the approximately 500 metrics available from the DB2 performance monitor, our system chooses 3 to construct a model that explains 72% of the variability of response time.


distributed systems operations and management | 2003

Generic Online Optimization of Multiple Configuration Parameters with Application to a Database Server

Yixin Diao; Frank Eskesen; Steven E. Froehlich; Joseph L. Hellerstein; Lisa Spainhower; Maheswaran Surendra

Optimizing configuration parameters is time-consuming and skills-intensive. This paper proposes a generic approach to automating this task. By generic, we mean that the approach is relatively independent of the target system for which the optimization is done. Our approach uses online adjustment of configuration parameters to discover the system’s performance characteristics. Doing so creates two challenges: (1) handling interdependencies between configuration parameters and (2) minimizing the deleterious effects on production workload while the optimization is underway. Our approach addresses (1) by including in the architecture a rule-based component that handles interdependencies between configuration parameters. For (2), we use a feedback mechanism for online optimization that searches the parameter space in a way that generally avoids poor performance at intermediate steps. Our studies of a DB2 Universal Database Server under an e-commerce workload indicate that our approach can be effective in practice.


ieee international symposium on fault tolerant computing | 1998

G4: a fault-tolerant CMOS mainframe

Lisa Spainhower; Thomas A. Gregg

G4 is IBMs fourth generation CMOS microprocessor-based S/390 mainframe but the first to achieve fault tolerant equivalence-or superiority-with its predecessor ECL mainframes. CMOS technology provides much greater density and integration, assuring superior fault avoidance characteristics. The reduced power of CMOS makes bulk power redundancy and battery backup practical. However, the high density and circuit properties of CMOS pose new challenges for detection, recovery, and online repair. G4 implements an innovative design for a high performance, fault tolerant, single-chip microprocessor. Microprocessor sparing is used as a concurrent repair mechanism. Increased memory density requires new (76,64) S4EC/DED Error Correction Codes so that all single chip failures are correctable. As many as four I/O interfaces are packaged on an individual card, requiring both configuration management and automated maintenance procedures to assure all devices maintain connectivity during online repair.


ieee international symposium on fault tolerant computing | 1992

Design for fault-tolerance in system ES model 900

Lisa Spainhower; Jack Isenberg; Ram Chillarege; Joseph Berding

The authors present the design for fault-tolerance in the IBM ES/9000 Model 900 high-end commercial processor. The design exploits circuit level concurrent-error detection, fault-identification, and reconfiguration with system level techniques when multiple functional resources are available. It provides true graceful degradation during central processor or channel reconfiguration and repair. The authors discuss the design point for this processor and the trade-offs involved; show the error detection and online repair process of a central processor with the work recovered on an alternate central processor, transparent to the application; describe dynamic path selection and the hot-pluggable channels; and illustrate the fault-tolerance techniques used in the level 1 cache and the central store.<<ETX>>


Operating Systems Review | 2008

Vigilant: out-of-band detection of failures in virtual machines

Dan Pelleg; Muli Ben-Yehuda; Rick Harper; Lisa Spainhower; Tokunbo O. S. Adeshiyan

What do our computer systems do all day? How do we make sure they continue doing it when failures occur? Traditional approaches to answering these questions often involve in-band monitoring agents. However in-band agents suffer from several drawbacks: they need to be written or customized for every workload (operating system and possibly also application), they comprise potential security liabilities, and are themselves affected by adverse conditions in the monitored systems. Virtualization technology makes it possible to encapsulate an entire operating system or application instance within a virtual object that can then be easily monitored and manipulated without any knowledge of the contents or behavior of that object. This can be done out-of-band, using general purpose agents that do not reside inside the object, and hence are not affected by the behavior of the object. This paper describes Vigilant, a novel way of monitoring virtual machines for problems. Vigilant requires no specialized agents inside a virtual object it is monitoring. Instead, it uses the hypervisor to directly monitor the resource requests and utilization of an object. Machine learning methods are then used to analyze the readings. Our experimental results show that problems can be detected out-of-band with high accuracy. Using Vigilant we demonstrate that out-of-band monitoring using virtualization and machine learning can accurately identify faults in the guest OS, while avoiding the many pitfalls associated with in-band monitoring.


availability, reliability and security | 2008

A Case for High Availability in a Virtualized Environment (HAVEN)

Erin M. Farr; Richard E. Harper; Lisa Spainhower; Jimi Xenidis

The cost and operational complexity of traditional high availability solutions has limited their widespread adoption. Virtualization allows availability properties to be associated with the system architecture, rather than depending on the intrinsic reliability of components. This paper introduces an extensible grammar that classifies the states and transitions of virtual machine images. From this grammar, rules for recovery and High Availability can be created which define how virtualization allows for simplified fault tolerance, making HAVENs accessible to the mainstream user.


Ibm Journal of Research and Development | 2009

Using virtualization for high availability and disaster recovery

Tokunbo O. S. Adeshiyan; C. R. Attanasio; Erin M. Farr; Richard E. Harper; Dan Pelleg; Charles O. Schulz; Lisa Spainhower; Paula Ta-Shma; Lorrie A. Tomek

Traditional high-availability and disaster recovery solutions require proprietary hardware, complex configurations, applicationspecific logic, highly skilled personnel, and a rigorous and lengthy testing process. The resulting high costs have limited their adoption to environments with the most critical applications. However, high availability and disaster recovery are becoming increasingly important in many environments that cannot bear the complexity and the expense involved. In this paper, we show that virtualization can be used to develop solutions that meet this market demand. We describe the recently released Virtual Availability Manager (VAM) product offering, which provides simplified availability solutions using Xent-based virtualization, and which is available as part of the IBM Systems Director product. We present key design principles of VAM, explain its architecture and current capabilities, and describe the way it is being extended to enable recovery in case of disaster.


IEEE Transactions on Network and Service Management | 2004

Generic On-Line Discovery of Quantitative Models

Alexander Keller; Yixin Diao; Frank Eskesen; Steven E. Froehlich; Joseph L. Hellerstein; Maheswaran Surendra; Lisa Spainhower

Quantitative models are needed for a variety of management tasks, including identification of critical variables to use for health monitoring, anticipating service-level violations by using predictive models, and ongoing optimization of configurations. Unfortunately, constructing quantitative models requires specialized skills that are in short supply. Even worse, rapid changes in provider configurations and the evolution of business demands mean that quantitative models must be updated on an ongoing basis. This paper describes an architecture and algorithms for online discovery of quantitative models without prior knowledge of the managed elements. The architecture makes use of an element schema that describes managed elements using the Common Information Model (CIM). Algorithms are presented for selecting a subset of the element metrics to use as explanatory variables in a quantitative model and for constructing the quantitative model itself. We further describe a prototype system based onthis architecture that incorporates these algorithms. We apply the prototype to online estimation of response times for DB2 Universal Database under a TPC-W workload. Of the approximately 500 metrics available from the DB2 performance monitor, our system chooses three to construct a model that explains 72 percent of the variability of response time.


international symposium on microarchitecture | 1994

IBM's ES/9000 Model 982's fault-tolerant design for consolidation

Lisa Spainhower; Thomas A. Gregg; Ram Chillarege

Consolidated work loads running around the clock means that todays large, general-purpose computers must meet high availability demands. To meet these demands, it is argued that the Model 982 provides fault tolerance by combining enhanced circuit-level error detection and failure isolation techniques with system-level techniques exploiting inherent redundancy.<<ETX>>

Researchain Logo
Decentralizing Knowledge