Andrew J. DuBois | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrew J. DuBois is active.

Explore More

Publication

Featured researches published by Andrew J. DuBois.

IEEE Transactions on Device and Materials Reliability | 2012

Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer

Sarah Michalak; Andrew J. DuBois; Curtis B. Storlie; Heather Quinn; William N. Rust; David H. DuBois; David G. Modl; Andrea Manuzzato; Sean Blanchard

Microprocessor-based systems are a common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that may take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e., computationally incorrect results that originate from an undetected fault. In recent years, neutron-induced effects in HPC hardware have been observed, and researchers have started to study how neutrons impact microprocessor-based computations. This paper presents results from an accelerated neutron-beam test focusing on two microprocessors used in Roadrunner, which is the first petaflop supercomputer. Research questions of interest include whether the application running affects neutron susceptibility and whether different replicates of the hardware under test have different susceptibilities to neutrons. Estimated failures in time for crashes and for SDC are presented for the hardware under test, for the Triblade servers used for computation in Roadrunner, and for Roadrunner.

IEEE Transactions on Visualization and Computer Graphics | 2007

NPU-Based Image Compositing in a Distributed Visualization System

David L. Pugmire; Laura Monroe; Carolyn Connor Davenport; Andrew J. DuBois; David H. DuBois; Stephen W. Poole

This paper describes the first use of a network processing unit (NPU) to perform hardware-based image composition in a distributed rendering system. The image composition step is a notorious bottleneck in a clustered rendering system. Furthermore, image compositing algorithms do not necessarily scale as data size and number of nodes increase. Previous researchers have addressed the composition problem via software and/or custom-built hardware. We used the heterogeneous multicore computation architecture of the Intel IXP28XX NPU, a fully programmable commercial off-the-shelf (COTS) technology, to perform the image composition step. With this design, we have attained a nearly four-times performance increase over traditional software-based compositing methods, achieving sustained compositing rates of 22-28 fps on a 1.021times1.024 image. This system is fully scalable with a negligible penalty in frame rate, is entirely COTS, and is flexible with regard to operating system, rendering software, graphics cards, and node architecture. The NPU-based compositor has the additional advantage of being a modular compositing component that is eminently suitable for integration into existing distributed software visualization packages.

ACM Transactions on Reconfigurable Technology and Systems | 2010

Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer with Application

David H. DuBois; Andrew J. DuBois; Thomas M Boorman; Carolyn Connor; Steve Poole

Double precision floating point Sparse Matrix-Vector Multiplication (SMVM) is a critical computational kernel used in iterative solvers for systems of sparse linear equations. The poor data locality exhibited by sparse matrices along with the high memory bandwidth requirements of SMVM result in poor performance on general purpose processors. Field Programmable Gate Arrays (FPGAs) offer a possible alternative with their customizable and application-targeted memory sub-system and processing elements.

Journal of Computational and Graphical Statistics | 2012

Developing Systems for Real-Time Streaming Analysis

Sarah Michalak; Andrew J. DuBois; David H. DuBois; Scott Vander Wiel; John Hogden

Sources of streaming data are proliferating and so are the demands to analyze and mine such data in real time. Statistical methods frequently form the core of real-time analysis, and therefore, statisticians increasingly encounter the challenges and implicit requirements of real-time systems. This work recommends a comprehensive strategy for development and implementation of streaming algorithms, beginning with exploratory data analysis in a flexible computing environment, leading to specification of a computational algorithm for the streaming setting and its initial implementation, and culminating in successive improvements to computational efficiency and throughput. This sequential development relies on a collaboration between statisticians, domain scientists, and the computer engineers developing the real-time system. This article illustrates the process in the context of a radio astronomy challenge to mitigate adverse impacts of radio frequency interference (noise) in searches for high-energy impulses from distant sources. The radio astronomy application motivates discussion of system design, code optimization, and the use of hardware accelerators such as graphics processing units, field-programmable gate arrays, and IBM Cell processors. Supplementary materials, available online, detail the computing systems typically used for streaming systems with real-time constraints and the process of optimizing code for high efficiency and throughput.

Journal of the American Statistical Association | 2013

A Bayesian Reliability Analysis of Neutron-Induced Errors in High Performance Computing Hardware

Curtis B. Storlie; Sarah Michalak; Heather Quinn; Andrew J. DuBois; Steven A. Wender; David H. DuBois

A soft error is an undesired change in an electronic devices state, for example, a bit flip in computer memory, that does not permanently affect its functionality. In microprocessor systems, neutron-induced soft errors can cause crashes and silent data corruption (SDC). SDC occurs when a soft error produces a computational result that is incorrect, without the system issuing a warning or error message. Hence, neutron-induced soft errors are a major concern for high performance computing platforms that perform scientific computation. Through accelerated neutron beam testing of hardware in its field configuration, the frequencies of failures (crashes) and of SDCs in hardware from the Roadrunner platform, the first Petaflop supercomputer, are estimated. The impact of key factors on field performance is investigated and estimates of field reliability are provided. Finally, a novel statistical approach for the analysis of interval-censored survival data with mixed effects and uncertainty in the interval endpoints, key features of the experimental data, is presented. Supplementary materials for this article are available online.

field programmable custom computing machines | 2008

Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer

David H. DuBois; Andrew J. DuBois; Carolyn Connor; Stephen W. Poole

radiation effects data workshop | 2010

Neutron Beam Testing of High Performance Computing Hardware

Sarah Michalak; Andrew J. DuBois; Curtis B. Storlie; Heather Quinn; William N. Rust; David H. DuBois; David G. Modl; Andrea Manuzzato; Sean Blanchard

Microprocessor-based systems are the most common design for high-performance computing (HPC) platforms. In these systems, several thousands of microprocessors can participate in a single calculation that could take weeks or months to complete. When used in this manner, a fault in any of the microprocessors could cause the computation to crash or cause silent data corruption (SDC), i.e.~computationally incorrect results. In recent years, neutron-induced failures in HPC hardware have been observed, and researchers have started to study how neutron radiation affect microprocessor-based scientific computations. This paper presents results from an accelerated neutron test focusing on two microprocessors used in Roadrunner, the first Petaflop system.

ieee international conference on high performance computing data and analytics | 2014

Correctness field testing of production and decommissioned high performance computing platforms at los alamos national laboratory

Sarah Michalak; William N. Rust; John T. Daly; Andrew J. DuBois; David H. DuBois

Silent Data Corruption (SDC) can threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems. To characterize this issue, correctness field testing of HPC platforms at Los Alamos National Laboratory was performed. This work presents results for 12 platforms, including over 1,000 node-years of computation performed on over 8,750 compute nodes and over 260 petabytes of data transfers involving nearly 6,000 compute nodes, and relevant lessons learned. Incorrect results characteristic of transient errors and of intermittent errors were observed. These results are a key underpinning to resilience efforts as they provide signatures of incorrect results observed under field conditions. Five incorrect results consistent with a transient error mechanism were observed, suggesting that the effects of transient errors could be mitigated. However, the observed numbers of incorrect results consistent with an intermittent error mechanism suggest that intermittent errors could substantially effect computational correctness.

field-programmable custom computing machines | 2009

Non-Preconditioned Conjugate Gradient on Cell and FPGA Based Hybrid Supercomputer Nodes

David H. DuBois; Andrew J. DuBois; Thomas M Boorman; Carolyn Connor

This work presents a detailed implementation of a double precision, non-preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture™ in conjunction with x86 Opteron™ processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.

Archive | 2013