Bryan N. Mills | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bryan N. Mills is active.

Explore More

Publication

Featured researches published by Bryan N. Mills.

international workshop on energy efficient supercomputing | 2013

Evaluating energy savings for checkpoint/restart

Bryan N. Mills; Ryan E. Grant; Kurt Brian Ferreira; Rolf Riesen

The U. S. Department of Energy has identified resilience and energy consumption as key challenges for future extreme-scale systems. All checkpoint/restart methods require I/O to local or remote storage. Efforts are under way to minimize the amount of data movement and increase scalability. Nevertheless, the energy consumed by fault resilience methods will increase with system size. It is therefore important to understand the performance overhead in conjunction with the energy consumption of each fault resilience method. In this paper we explore throttling CPU power consumption during I/O intensive checkpoint operations of real applications. We find that 10% total energy savings are possible with little impact on application time to solution.

2014 International Conference on Computing, Networking and Communications (ICNC) | 2014

Shadow Computing: An energy-aware fault tolerant computing model

Bryan N. Mills; Taieb Znati; Rami G. Melhem

The current response to fault tolerance relies upon either time or hardware redundancy in order to mask faults. Time redundancy implies a re-execution of the failed computation after the failure has been detected, although this can further be optimized by the use of checkpoints these solutions still impose a significant delay. In many mission critical systems hardware redundancy has traditionally deployed in the form of process replication to provide fault tolerance, avoiding delay and maintaining tight deadlines. Both approaches have drawbacks, re-execution requiring additional time and replication requiring additional resources, especially energy. This forces the systems engineer to choose between time or hardware redundancy, cloud computing environments have largely chosen replication because response time is often critical. In this paper we propose a new computational model called shadow computing, which provides goal-based adaptive resilience through the use of dynamic execution. Using this general model we develop shadow replication which enables a parameterized tradeoff between time and hardware redundancy to provide fault tolerance. Then we build an analytical model to predict the expected energy savings and provide an analysis using that model.

parallel, distributed and network-based processing | 2014

Energy Consumption of Resilience Mechanisms in Large Scale Systems

Bryan N. Mills; Taieb Znati; Rami G. Melhem; Kurt Brian Ferreira; Ryan E. Grant

As HPC systems continue to grow to meet the requirements of tomorrows exascale-class systems, two of the biggest challenges are power consumption and system resilience. On current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication. In this paper we address both resilience and power together, this is in contrast to much of the competed work which does so independently. Using an analytical model that accounts for both power consumption and failures, we study the performance of checkpoint and replication-based techniques on current and future systems and use power measurements from current systems to validate our findings. Lastly, in an attempt to optimize power consumption for replication, we introduce a new protocol termed shadow replication which not only reduces energy consumption but also produces faster response times than checkpoint/restart and traditional replication when operating under system power constraints.

network and distributed system security symposium | 2007