Sean Blanchard | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sean Blanchard is active.

Explore More

Publication

Featured researches published by Sean Blanchard.

ieee international conference on high performance computing data and analytics | 2013

Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

Vilas Sridharan; Jon Stearley; Nathan DeBardeleben; Sean Blanchard; Sudhanva Gurumurthi

Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing systems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance computing systems. Our goal is to understand the factors that influence faults in production settings. We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study the impact of altitude and rack placement on SRAM faults, finding that, as expected, altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate.

architectural support for programming languages and operating systems | 2015

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Vilas Sridharan; Nathan DeBardeleben; Sean Blanchard; Kurt Brian Ferreira; Jon Stearley; John Shalf; Sudhanva Gurumurthi

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on todays systems.

IEEE Transactions on Device and Materials Reliability | 2012

Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer

Sarah Michalak; Andrew J. DuBois; Curtis B. Storlie; Heather Quinn; William N. Rust; David H. DuBois; David G. Modl; Andrea Manuzzato; Sean Blanchard

Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.

international parallel and distributed processing symposium | 2014

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability

Qiang Guan; Nathan DeBardeleben; Sean Blanchard; Song Fu

As the high performance computing (HPC) community continues to push towards exascale computing, resilience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without requiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.

international conference on parallel processing | 2011

Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience

Nathan DeBardeleben; Sean Blanchard; Qiang Guan; Ziming Zhang; Song Fu

As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the rise. HPC programmers of today have to deal with these faults to a small degree and it is expected this will only be a larger problem as systems continue to scale. In this paper we present SEFI, the Soft Error Fault Injection framework, a tool for profiling software for its susceptibility to soft errors. In particular, we focus in this paper on logic soft error injection. Using the open source virtual machine and processor emulator (QEMU), we demonstrate modifying emulated machine instructions to introduce soft errors. We conduct experiments by modifying the virtual machine itself in a way that does not require intimate knowledge of the tested application. With this technique, we show that we are able to inject simulated soft errors in the logic operations of a target application without affecting other applications or the operating system sharing the VM. We present some initial results and discuss where we think this work will be useful in next generation hardware/software co-design.

european conference on parallel processing | 2013

GPU Behavior on a Large HPC Cluster

Nathan DeBardeleben; Sean Blanchard; Laura Monroe; Philip Romero; Daryl Grunau; Craig Idler; Cornell Wright

We discuss observed characteristics of GPUs deployed as accelerators in an HPC cluster at Los Alamos National Laboratory. GPUs have a very good theoretical FLOPS rate, and are reasonably inexpensive and available, but they are relatively new to HPC, which demands both consistently high performance across nodes and consistently low error rate.

pacific rim international symposium on dependable computing | 2013

Exploring Time and Frequency Domains for Accurate and Automated Anomaly Detection in Cloud Computing Systems

Qiang Guan; Song Fu; Nathan DeBardeleben; Sean Blanchard

Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructures. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software faults and environmental factors. Autonomic anomaly detection is crucial for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect anomalous cloud behaviors, we need to monitor the cloud execution and collect runtime cloud performance data. For different types of failures, the data display different correlations with the performance metrics. In this paper, we present a wavelet-based multi-scale anomaly identification mechanism, that can analyze profiled cloud performance metrics in both time and frequency domains and identify anomalous cloud behaviors. Learning technologies are exploited to adapt the selection of mother wavelets and a sliding detection window is employed to handle cloud dynamicity and improve anomaly detection accuracy. We have implemented a prototype of the anomaly identification system and conducted experiments on an on-campus cloud computing environment. Experimental results show the proposed mechanism can achieve 93.3% detection sensitivity while keeping the false positive rate as low as 6.1% while outperforming other tested anomaly detection schemes.

international conference on machine learning and applications | 2016

Relational Synthesis of Text and Numeric Data for Anomaly Detection on Computing System Logs

Elisabeth Baseman; Sean Blanchard; Zongze Li; Song Fu

Monitoring high performance computing systems has become increasingly difficult as researchers and system analysts face the challenge of synthesizing a wide range of monitoring information in order to detect system problems on ever larger machines. We present a method for anomaly detection on syslog data, one of the most important data streams for determining system health. Syslog messages pose a difficult question for analysis because they include a mix of structured natural language text as well as numeric values. We present an anomaly detection framework that combines graph analysis, relational learning, and kernel density estimation to detect unusual syslog messages. We design an event block detector, which finds groups of related syslog messages, to retrieve the entire section of syslog messages associated with a single anomalous line. Our novel approach successfully retrieves anomalous behaviors inserted into syslog files from a virtual machine, including messages indicating serious system problems. We also test our approach on syslog messages from the Trinity supercomputer and find that our methods do not generate significant false positives.

high performance distributed computing | 2016

Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra

Panruo Wu; Qiang Guan; Nathan DeBardeleben; Sean Blanchard; Dingwen Tao; Xin Liang; Jieyang Chen; Zizhong Chen

Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the architecture level and what the algorithm expects. As the fault model is the deciding factor in choosing an effective checksum scheme, the resulting ABFT techniques have seen limited impact in practice. In this paper we seek to close the gap by directly using a comprehensive architectural fault model and devise a comprehensive ABFT scheme that can tolerate multiple architectural faults of various kinds. We implement the new ABFT scheme into high performance linpack (HPL) to demonstrate the feasibility in large scale high performance benchmark. We conduct architectural fault injection experiments and large scale experiments to empirically validate its fault tolerance and demonstrate the overhead of error handling, respectively.

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale | 2015

Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection

Qiang Guan; Nathan DeBardeleben; Sean Blanchard; Song Fu

Soft errors are becoming an important issue in computing systems. Near threshold voltage (NTV), reduced circuit sizes, high performance computing (HPC), and high altitude computing all present interesting challenges in this area. Much of the existing literature has focused on hardware techniques to mitigate and measure soft errors at the hardware level. Instead, in this paper we explore the soft error susceptibility of three common sorting algorithms at the software layer. We focus on the comparison operator and use our software fault injection tool to place faults with fine precision during the execution of these algorithms. We explore how the algorithm susceptibilities vary based on input and bit position and relate these faults back to the source code to study how algorithmic decisions impact the reliability of the codes. Finally, we look at the question of the number of fault injections required for statistical significance. Using standard practice equations used in hardware fault injection experiments we calculate the number of injections that should be required to achieve confidence in our results. Then we show, empirically, that more fault injections are required before we gain confidence in our experiments.

Explore More