Ioan A. Stefanovici
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ioan A. Stefanovici.
architectural support for programming languages and operating systems | 2012
Andy A. Hwang; Ioan A. Stefanovici; Bianca Schroeder
Main memory is one of the leading hardware causes for machine crashes in todays datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors. In this paper, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, the paper uses the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluates the potential of different protection mechanisms in the light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system.
measurement and modeling of computer systems | 2012
Nosayba El-Sayed; Ioan A. Stefanovici; George Amvrosiadis; Andy A. Hwang; Bianca Schroeder
The energy consumed by data centers is starting to make up a significant fraction of the worlds energy consumption and carbon emissions. A large fraction of the consumed energy is spent on data center cooling, which has motivated a large body of work on temperature management in data centers. Interestingly, a key aspect of temperature management has not been well understood: controlling the setpoint temperature at which to run a data centers cooling system. Most data centers set their thermostat based on (conservative) suggestions by manufacturers, as there is limited understanding of how higher temperatures will affect the system. At the same time, studies suggest that increasing the temperature setpoint by just one degree could save 2-5% of the energy consumption. This paper provides a multi-faceted study of temperature management in data centers. We use a large collection of field data from different production environments to study the impact of temperature on hardware reliability, including the reliability of the storage subsystem, the memory subsystem and server reliability as a whole. We also use an experimental testbed based on a thermal chamber and a large array of benchmarks to study two other potential issues with higher data center temperatures: the effect on server performance and power. Based on our findings, we make recommendations for temperature management in data centers, that create the potential for saving energy, while limiting negative effects on system reliability and performance.
symposium on cloud computing | 2015
Ioan A. Stefanovici; Eno Thereska; Greg O'Shea; Bianca Schroeder; Hitesh Ballani; Thomas Karagiannis; Antony I. T. Rowstron; Tom Talpey
In data centers, caches work both to provide low IO latencies and to reduce the load on the back-end network and storage. But they are not designed for multi-tenancy; system-level caches today cannot be configured to match tenant or provider objectives. Exacerbating the problem is the increasing number of un-coordinated caches on the IO data plane. The lack of global visibility on the control plane to coordinate this distributed set of caches leads to inefficiencies, increasing cloud provider cost. We present Moirai, a tenant- and workload-aware system that allows data center providers to control their distributed caching infrastructure. Moirai can help ease the management of the cache infrastructure and achieve various objectives, such as improving overall resource utilization or providing tenant isolation and QoS guarantees, as we show through several use cases. A key benefit of Moirai is that it is transparent to applications or VMs deployed in data centers. Our prototype runs unmodified OSes and databases, providing immediate benefit to existing applications.
IEEE Spectrum | 2015
Ioan A. Stefanovici; Andy A. Hwang; Bianca Schroeder
Not long after the first personal computers started entering peoples homes, Intel fell victim to a nasty kind of memory error. The company, which had commercialized the very first dynamic random-access memory (DRAM) chip in 1971 with a 1,024-bit device, was continuing to increase data densities. A few years later, Intels then cutting-edge 16-kilobit DRAM chips were sometimes storing bits differently from the way they were written. Indeed, they were making these mistakes at an alarmingly high rate. The cause was ultimately traced to the ceramic packaging for these DRAM devices. Trace amounts of radioactive material that had gotten into the chip packaging were emitting alpha particles and corrupting the data.
ACM Transactions on Storage | 2017
Ioan A. Stefanovici; Bianca Schroeder; Greg O'Shea; Eno Thereska
In a data center, an IO from an application to distributed storage traverses not only the network but also several software stages with diverse functionality. This set of ordered stages is known as the storage or IO stack. Stages include caches, hypervisors, IO schedulers, file systems, and device drivers. Indeed, in a typical data center, the number of these stages is often larger than the number of network hops to the destination. Yet, while packet routing is fundamental to networks, no notion of IO routing exists on the storage stack. The path of an IO to an endpoint is predetermined and hard coded. This forces IO with different needs (e.g., requiring different caching or replica selection) to flow through a one-size-fits-all IO stack structure, resulting in an ossified IO stack. This article proposes sRoute, an architecture that provides a routing abstraction for the storage stack. sRoute comprises a centralized control plane and “sSwitches” on the data plane. The control plane sets the forwarding rules in each sSwitch to route IO requests at runtime based on application-specific policies. A key strength of our architecture is that it works with unmodified applications and Virtual Machines (VMs). This article shows significant benefits of customized IO routing to data center tenants: for example, a factor of 10 for tail IO latency, more than 60% better throughput for a customized replication protocol, a factor of 2 in throughput for customized caching, and enabling live performance debugging in a running system.
file and storage technologies | 2016
Ioan A. Stefanovici; Bianca Schroeder; Greg O'Shea; Eno Thereska
usenix annual technical conference | 2017
Farzaneh Mahdisoltani; Ioan A. Stefanovici; Bianca Schroeder
HotStorage | 2018
Patrick Anderson; Richard Black; Ausra Cerkauskaite; Andromachi Chatzieleftheriou; James Clegg; Christopher Dainty; Raluca Diaconu; Rokas Drevinskas; Austin Donnelly; Alexander L. Gaunt; Andreas Georgiou; Ariel Gomez Diaz; Peter G. Kazansky; David Lara; Sergey Legtchenko; Sebastian Nowozin; Aaron W. Ogus; Douglas Phillips; Antony I. T. Rowstron; Masaaki Sakakura; Ioan A. Stefanovici; Benn Thomsen; Lei Wang; Hugh E. Williams; Mengyang Yang
Archive | 2017
Farzaneh Mahdisoltani; Ioan A. Stefanovici; Bianca Schroeder
Archive | 2016
Ioan A. Stefanovici