Scott Levy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Scott Levy is active.

Explore More

Publication

Featured researches published by Scott Levy.

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2013

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Scott Levy; Bryan Topp; Kurt Brian Ferreira; Dorian C. Arnold; Torsten Hoefler; Patrick M. Widener

Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales—allowing the simulator to run 4x faster and use over 100x less memory.

ieee international conference on high performance computing data and analytics | 2014

Understanding the effects of communication and coordination on checkpointing at scale

Kurt Brian Ferreira; Patrick M. Widener; Scott Levy; Dorian C. Arnold; Torsten Hoefler

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local nodes compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.

ieee international conference on high performance computing data and analytics | 2016

Understanding performance interference in next-generation HPC systems

Oscar H. Mondragon; Patrick G. Bridges; Scott Levy; Kurt Brian Ferreira; Patrick M. Widener

Next-generation systems face a wide range of new potential sources of application interference, including resilience actions, system software adaptation, and in situ analytics programs. In this paper, we present a new model for analyzing the performance of bulk-synchronous HPC applications based on the use of extreme value theory. After validating this model against both synthetic and real applications, the paper then uses both simulation and modeling techniques to profile next-generation interference sources and characterize their behavior and performance impact on a selection of HPC benchmarks, mini-applications, and applications. Lastly, this work shows how the model can be used to understand how current interference mitigation techniques in multi-processors work.

ieee international conference on high performance computing data and analytics | 2016

On noise and the performance benefit of nonblocking collectives

Patrick M. Widener; Scott Levy; Kurt Brian Ferreira; Torsten Hoefler

Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise-mitigation effects of idealized nonblocking collectives, in workloads where these collectives are a major contributor to total execution time. Although nonblocking collectives are unlikely to provide significant noise mitigation to applications in the low operating system noise environments expected in next-generation high-performance computing systems, we show that they can potentially improve application runtime with respect to other noise types.

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale | 2013

Using unreliable virtual hardware to inject errors in extreme-scale systems

Scott Levy; Matthew G. F. Dosanjh; Patrick G. Bridges; Kurt Brian Ferreira

Fault tolerance is a key obstacle to next generation extreme-scale systems. As systems scale, the Mean Time To Interrupt (MTTI) decreases proportionally. As a result, extreme-scale systems are likely to experience higher rates of failure in the future. To mitigate this, significant research has focused on developing and validating fault tolerance techniques. However, evaluating techniques for withstanding hardware failures at large scale is challenging because replicating those failures on small-scale testbeds is difficult. In this paper, we propose a virtualization-based framework for creating testbeds with unreliable virtual hardware. Our proposed approach allows for comprehensive evaluation of fault tolerance techniques in a broad range of failure regimes. Although there are many other approaches for mimicking unreliable hardware, none of them offer the breadth, scalability, and performance that a virtualization-based solution does.

ieee international conference on high performance computing data and analytics | 2016

Improving application resilience to memory errors with lightweight compression

Scott Levy; Kurt Brian Ferreira; Patrick G. Bridges

In next-generation extreme-scale systems, application performance will be limited by memory performance characteristics. The first exascale system is projected to contain many petabytes of memory. In addition to the sheer volume of the memory required, device trends, such as shrinking feature sizes and reduced supply voltages, have the potential to increase the frequency of memory errors. As a result, resilience to memory errors is a key challenge. In this paper, we evaluate the viability of using memory compression to repair detectable uncorrectable errors (DUEs) in memory. We develop a software library, evaluate its performance and demonstrate that it is able to significantly compress memory of HPC applications. Further, we show that exploiting compressed memory pages to correct memory errors can significantly improve application performance on next-generation systems.

cluster computing and the grid | 2016

Scheduling In-Situ Analytics in Next-Generation Applications

Oscar H. Mondragon; Patrick G. Bridges; Scott Levy; Kurt Brian Ferreira; Patrick M. Widener

Next-generation applications increasingly rely on in situ analytics to guide computation, reduce the amount of I/O performed, and perform other important tasks. Scheduling where and when to run analytics is challenging, however. This paper quantifies the costs and benefits of different approaches to scheduling applications and analytics on nodes in large-scale applications, including space sharing, uncoordinated time sharing, and gang scheduled time sharing.

Proceedings of the 21st European MPI Users' Group Meeting on | 2014

Exploring the effect of noise on the performance benefit of nonblocking allreduce

Patrick M. Widener; Kurt Brian Ferreira; Scott Levy; Torsten Hoefler

Relaxed synchronization offers the potential of maintaining application scalability by allowing many processes to make independent progress when some processes suffer delays. Yet, the benefits of this approach in important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise mitigation effects of nonblocking allreduce in workloads where allreduce is a major contributor to total execution time. Although a nonblocking allreduce is unlikely to provide significant benefit to applications in the low-OS-noise environments expected in next-generation HPC systems, we show that it can potentially improve application runtime with respect to other noise types.

Archive | 2014

Using simulation to evaluate the performance of resilience strategies and process failures

Scott Levy; Bryan Embry Topp; Dorian C Arnold; Kurt Brian Ferreira; Patrick M. Widener; Torsten Hoefler

Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales-allowing the simulator to run 4x faster and use over 100x less memory.

international workshop on runtime and operating systems for supercomputers | 2013

Evaluating the feasibility of using memory content similarity to improve system resilience

Scott Levy; Patrick G. Bridges; Kurt Brian Ferreira; Aidan P. Thompson; Christian Robert Trott

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grows, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory errors. In this paper, we propose a novel run-time for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the feasibility of this approach by examining memory snapshots collected from eight HPC applications. Based on the characteristics of the similarity that we uncover in these applications, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.

Explore More