Bianca Schroeder
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bianca Schroeder.
dependable systems and networks | 2006
Bianca Schroeder; Garth A. Gibson
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.
Journal of Physics: Conference Series | 2007
Bianca Schroeder; Garth A. Gibson
Withpetascale computers only a year or two away there isa pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it. One of the hardest problems in future high-performance computing (HPC) installations will be avoiding, coping and recovering from failures. The coming PetaFLOPS clusters will require the simultaneous use and control of hundreds of thousands or even millions of processing, storage, and networking elements. With this large number of elements involved, element failure will be frequent, making it increasingly difficult for applications to make forward progress. The success of petascale computing will depend on the ability to provide reliability and availability at scale. While researchers and practitioners have spent decades investigating approaches for avoiding, coping and recovering from failures, the progress in this area has been hindered by the lack of publicly available failure data from real large-scale systems. We have collected and analyzed a number of large data sets on failures in high-performance computing (HPC) systems. These data sets cover node outages in HPC clusters, as well as failures in storage systems. Using these data sets and large scale trends and assumptions commonly applied to future computing systems design, we project onto the potential machines of the next decade our expectations for failure rates, mean time to application interruption, and the consequential application utilization of the full machine, based on checkpoint/restart fault tolerance and the balanced system design method of matching storage bandwidth and memory size to aggregate computing power [14]. Not surprisingly, if the growth in aggregate computing power continues to outstrip the growth in perchip computing power, more and more of the computer’s resources may be spent on conventional fault recovery methods. We envision applications being denied as much as half of the system’s resources in five years, for example. We then discuss alternative actions that may compensate for this unacceptable
ACM Transactions on Computer Systems | 2003
Mor Harchol-Balter; Bianca Schroeder; Nikhil Bansal; Mukesh Agrawal
Is it possible to reduce the expected response time of every request at a web server, simply by changing the order in which we schedule the requests? That is the question we ask in this paper.This paper proposes a method for improving the performance of web servers servicing static HTTP requests. The idea is to give preference to requests for small files or requests with short remaining file size, in accordance with the SRPT (Shortest Remaining Processing Time) scheduling policy.The implementation is at the kernel level and involves controlling the order in which socket buffers are drained into the network. Experiments are executed both in a LAN and a WAN environment. We use the Linux operating system and the Apache and Flash web servers.Results indicate that SRPT-based scheduling of connections yields significant reductions in delay at the web server. These result in a substantial reduction in mean response time and mean slowdown for both the LAN and WAN environments. Significantly, and counter to intuition, the requests for large files are only negligibly penalized or not at all penalized as a result of SRPT-based scheduling.
IEEE Transactions on Dependable and Secure Computing | 2010
Bianca Schroeder; Garth A. Gibson
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.
architectural support for programming languages and operating systems | 2012
Andy A. Hwang; Ioan A. Stefanovici; Bianca Schroeder
Main memory is one of the leading hardware causes for machine crashes in todays datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors. In this paper, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, the paper uses the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluates the potential of different protection mechanisms in the light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system.
international test conference | 2003
Antonio Nucci; Bianca Schroeder; Supratik Bhattacharyya; Nina Taft; Christophe Diot
Intra-domain routing in IP backbone networks relies on link-state protocols such as IS-IS or OSPF. These protocols associate a weight (or cost) with each network link, and compute traffic routes based on these weight. However, proposed methods for selecting link weights largely ignore the issue of failures which arise as part of everyday network operations (maintenance, accidental, etc.). Changing link weights during a short-lived failure is impractical. However such failures are frequent enough to impact network performance. We propose a Tabu-search heuristic for choosing link weights which allow a network to function almost optimally during short link failures. The heuristic takes into account possible link failure scearios when choosing weights, thereby mitigating the effect of such failures. We find that the weights chosen by the heuristic can reduce link overload during transient link failures by as much as 40% at the cost of a small performance degradation in the absence of failures (10%).
measurement and modeling of computer systems | 2012
Nosayba El-Sayed; Ioan A. Stefanovici; George Amvrosiadis; Andy A. Hwang; Bianca Schroeder
The energy consumed by data centers is starting to make up a significant fraction of the worlds energy consumption and carbon emissions. A large fraction of the consumed energy is spent on data center cooling, which has motivated a large body of work on temperature management in data centers. Interestingly, a key aspect of temperature management has not been well understood: controlling the setpoint temperature at which to run a data centers cooling system. Most data centers set their thermostat based on (conservative) suggestions by manufacturers, as there is limited understanding of how higher temperatures will affect the system. At the same time, studies suggest that increasing the temperature setpoint by just one degree could save 2-5% of the energy consumption. This paper provides a multi-faceted study of temperature management in data centers. We use a large collection of field data from different production environments to study the impact of temperature on hardware reliability, including the reliability of the storage subsystem, the memory subsystem and server reliability as a whole. We also use an experimental testbed based on a thermal chamber and a large array of benchmarks to study two other potential issues with higher data center temperatures: the effect on server performance and power. Based on our findings, we make recommendations for temperature management in data centers, that create the potential for saving energy, while limiting negative effects on system reliability and performance.
international conference on data engineering | 2006
Bianca Schroeder; Mor Harchol-Balter; Arun Iyengar; Erich M. Nahum; Adam Wierman
Scheduling/prioritization of DBMS transactions is important for many applications that rely on database backends. A convenient way to achieve scheduling is to limit the number of transactions within the database, maintaining most of the transactions in an external queue, which can be ordered as desired by the application. While external scheduling has many advantages in that it doesn’t require changes to internal resources, it is also difficult to get right in that its performance depends critically on the particular multiprogramming limit used (the MPL), i.e. the number of transactions allowed into the database. If the MPL is too low, throughput will suffer, since not all DBMS resources will be utilized. On the other hand, if the MPL is too high, there is insufficient control on scheduling. The question of how to adjust theMPL to achieve both goals simultaneously is an open problem, not just for databases but in system design in general. Herein we study this problem in the context of transactional workloads, both via extensive experimentation and queueing theoretic analysis. We find that the two most critical factors in adjusting the MPL are the number of resources that the workload utilizes and the variability of the transactions’ service demands. We develop a feedback based controller, augmented by queueing theoretic models for automatically adjusting the MPL. Finally, we apply our methods to the specific problem of external prioritization of transactions. We find that external prioritization can be nearly as effective as internal prioritization, without any negative consequences, when the MPL is set appropriately.
ACM Transactions on Storage | 2010
Bianca Schroeder; Sotirios Damouras; Phillipa Gill
Latent sector errors (LSEs) refer to the situation where particular sectors on a drive become inaccessible. LSEs are a critical factor in data reliability, since a single LSE can lead to data loss when encountered during RAID reconstruction after a disk failure or in systems without redundancy. LSEs happen at a significant rate in the field [Bairavasundaram et al. 2007], and are expected to grow more frequent with new drive technologies and increasing drive capacities. While two approaches, data scrubbing and intra-disk redundancy, have been proposed to reduce data loss due to LSEs, none of these approaches has been evaluated on real field data. This article makes two contributions. We provide an extended statistical analysis of latent sector errors in the field, specifically from the view point of how to protect against LSEs. In addition to providing interesting insights into LSEs, we hope the results (including parameters for models we fit to the data) will help researchers and practitioners without access to data in driving their simulations or analysis of LSEs. Our second contribution is an evaluation of five different scrubbing policies and five different intra-disk redundancy schemes and their potential in protecting against LSEs. Our study includes schemes and policies that have been suggested before, but have never been evaluated on field data, as well as new policies that we propose based on our analysis of LSEs in the field.
ACM Transactions on Storage | 2007
Bianca Schroeder; Garth A. Gibson
Component failure in large-scale IT installations is becoming an ever-larger problem as the number of components in a single cluster approaches a million. This article is an extension of our previous study on disk failures [Schroeder and Gibson 2007] and presents and analyzes field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. More than 110,000 disks are covered by this data, some for an entire lifetime of five years. The data includes drives with SCSI and FC, as well as SATA interfaces. The mean time-to-failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2--4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. In other words, the replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC, and SATA drives, potentially an indication that disk-independent factors such as operating conditions affect replacement rates more than component-specific ones. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.