Sarah Michalak
Los Alamos National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sarah Michalak.
IEEE Transactions on Device and Materials Reliability | 2005
Sarah Michalak; Kevin W. Harris; Nicolas W. Hengartner; Bruce E. Takala; S.A. Wender
Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures.
IEEE Transactions on Device and Materials Reliability | 2012
Sarah Michalak; Andrew J. DuBois; Curtis B. Storlie; Heather Quinn; William N. Rust; David H. DuBois; David G. Modl; Andrea Manuzzato; Sean Blanchard
Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.
high performance graphics | 2011
Laura Monroe; Joanne Wendelberger; Sarah Michalak
We implement here a fast and memory-sparing probabilistic top k selection algorithm on the GPU. The algorithm proceeds via an iterative probabilistic guess-and-check process on pivots for a three-way partition. When the guess is correct, the problem is reduced to selection on a much smaller set. This probabilistic algorithm always gives a correct result and always terminates. Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.
Concurrency and Computation: Practice and Experience | 2016
Scott Pakin; Curtis B. Storlie; Michael Lang; Robert E. Fields; Eloy E. Romero; Craig Idler; Sarah Michalak; Hugh Greenberg; Josip Loncaric; Randal Rheinheimer; Gary Grider; Joanne Wendelberger
Power is becoming an increasingly important concern for large supercomputer centers. However, to date, there have been a dearth of studies of power usage ‘in the wild’—on production supercomputers running production workloads. In this paper, we present the initial results of a project to characterize the power usage of the three Top500 supercomputers at Los Alamos National Laboratory: Cielo, Roadrunner, and Luna (#15, #19, and #47, respectively, on the June 2012 Top500 list). Power measurements taken both at the switchboard level and within the compute racks are presented and discussed. Some noteworthy results of this study are that (1) variability in power consumption differs across architectures, even when running a similar workload and (2) Los Alamos National Laboratorys scientific workload draws, on average, only 70–75% of LINPACK power and only 40–55% of nameplate power, implying that power capping may enable a substantial reduction in power and cooling infrastructure while impacting comparatively few applications. Copyright
IEEE Transactions on Nuclear Science | 2014
Daniel Oliveira; Paolo Rech; Heather Quinn; Thomas D. Fairbanks; Laura Monroe; Sarah Michalak; Christine M. Anderson-Cook; Philippe Olivier Alexandre Navaux; Luigi Carro
Graphics processing units (GPUs) are increasingly common in both safety-critical and high-performance computing (HPC) applications. Some current supercomputers are composed of thousands of GPUs so the probability of device corruption becomes very high. Moreover, the GPUs parallel capabilities are very attractive for the automotive and aerospace markets, where reliability is a serious concern. In this paper, the neutron sensitivity of the modern GPU caches, and internal resources are experimentally evaluated. Various Duplication With Comparison strategies to reduce GPU radiation sensitivity are then presented and validated through radiation experiments. Threads should be carefully duplicated to avoid undesired errors on shared resources and to avoid the exacerbation of errors in critical resources such as the scheduler.
Journal of Computational and Graphical Statistics | 2012
Sarah Michalak; Andrew J. DuBois; David H. DuBois; Scott Vander Wiel; John Hogden
Sources of streaming data are proliferating and so are the demands to analyze and mine such data in real time. Statistical methods frequently form the core of real-time analysis, and therefore, statisticians increasingly encounter the challenges and implicit requirements of real-time systems. This work recommends a comprehensive strategy for development and implementation of streaming algorithms, beginning with exploratory data analysis in a flexible computing environment, leading to specification of a computational algorithm for the streaming setting and its initial implementation, and culminating in successive improvements to computational efficiency and throughput. This sequential development relies on a collaboration between statisticians, domain scientists, and the computer engineers developing the real-time system. This article illustrates the process in the context of a radio astronomy challenge to mitigate adverse impacts of radio frequency interference (noise) in searches for high-energy impulses from distant sources. The radio astronomy application motivates discussion of system design, code optimization, and the use of hardware accelerators such as graphics processing units, field-programmable gate arrays, and IBM Cell processors. Supplementary materials, available online, detail the computing systems typically used for streaming systems with real-time constraints and the process of optimizing code for high efficiency and throughput.
IEEE Transactions on Nuclear Science | 2010
Eugene Normand; Jerry L. Wert; Heather Quinn; Thomas D. Fairbanks; Sarah Michalak; Gary Grider; Paul N Iwanchuk; John Morrison; S.A. Wender; Steve Johnson
Records of bit flips in the Cray-1 computer installed at Los Alamos, NM, in 1976 lead to an upset rate in the Cray-1s bipolar SRAMs that correlates with the single-event upsets (SEUs) being induced by the atmospheric neutrons.
Journal of the American Statistical Association | 2013
Curtis B. Storlie; Sarah Michalak; Heather Quinn; Andrew J. DuBois; Steven A. Wender; David H. DuBois
A soft error is an undesired change in an electronic devices state, for example, a bit flip in computer memory, that does not permanently affect its functionality. In microprocessor systems, neutron-induced soft errors can cause crashes and silent data corruption (SDC). SDC occurs when a soft error produces a computational result that is incorrect, without the system issuing a warning or error message. Hence, neutron-induced soft errors are a major concern for high performance computing platforms that perform scientific computation. Through accelerated neutron beam testing of hardware in its field configuration, the frequencies of failures (crashes) and of SDCs in hardware from the Roadrunner platform, the first Petaflop supercomputer, are estimated. The impact of key factors on field performance is investigated and estimates of field reliability are provided. Finally, a novel statistical approach for the analysis of interval-censored survival data with mixed effects and uncertainty in the interval endpoints, key features of the experimental data, is presented. Supplementary materials for this article are available online.
radiation effects data workshop | 2010
Sarah Michalak; Andrew J. DuBois; Curtis B. Storlie; Heather Quinn; William N. Rust; David H. DuBois; David G. Modl; Andrea Manuzzato; Sean Blanchard
Microprocessor-based systems are the most common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that could take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e.~computationally incorrect results. In recent years, neutron-induced failures in HPC hardware have been observed, and researchers have started to study how neutron radiation affect microprocessor-based scientific computations. This paper presents results from an accelerated neutron test focusing on two microprocessors used in Roadrunner, the first Petaflop system.
Technometrics | 2008
Nicolas W. Hengartner; Sarah Michalak; Bruce E. Takala; S.A. Wender
Cosmic ray–induced neutrons can cause bit flips in silicon-based electronic devices, such as computer memory. To estimate the frequency of occurrence, a device may be tested by placing it in a neutron beam at a testing facility, such as the Irradiation of Chips and Electronics facility at the Los Alamos Neutron Science Center, Los Alamos National Laboratory. The bit failure cross-section of a silicon-based electronic device describes the probability of causing a bit flip as a function of neutron energy. This article discusses estimation of the bit failure cross-section based on neutron beam testing. We show that this is a severely ill-posed inverse problem. We present a general methodology for evaluating, before the experiment, the extent to which the experimental protocol permits estimation of the bit failure cross-section through nonparametric penalized maximum likelihood.