Don Maxwell | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Don Maxwell is active.

Explore More

Publication

Featured researches published by Don Maxwell.

high-performance computer architecture | 2015

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Devesh Tiwari; Saurabh Gupta; James H. Rogers; Don Maxwell; Paolo Rech; Sudharshan S. Vazhkudai; Daniel Oliveira; Dave Londo; Nathan DeBardeleben; Philippe Olivier Alexandre Navaux; Luigi Carro; Arthur S. Bland

Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the worlds second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.

ieee international conference on high performance computing data and analytics | 2015

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility

Devesh Tiwari; Saurabh Gupta; George Gallarno; James H. Rogers; Don Maxwell

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The worlds second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

international supercomputing conference | 2013

TUE, a New Energy-Efficiency Metric Applied at ORNL’s Jaguar

Michael K. Patterson; Stephen W. Poole; Chung-Hsing Hsu; Don Maxwell; William Tschudi; Henry Coles; David Martinez; Natalie J. Bates

The metric, Power Usage Effectiveness (PUE), has been successful in improving energy efficiency of data centers, but it is not perfect. One challenge is that PUE does not account for the power distribution and cooling losses inside IT equipment. This is particularly problematic in the HPC (high performance computing) space where system suppliers are moving cooling and power subsystems into or out of the cluster. This paper proposes two new metrics: ITUE (IT-power usage effectiveness), similar to PUE but “inside” the IT and TUE (total-power usage effectiveness), which combines the two for a total efficiency picture. We conclude with a demonstration of the method, and a case study of measurements at ORNL’s Jaguar system. TUE provides a ratio of total energy, (internal and external support energy uses) and the specific energy used in the HPC. TUE can also be a means for comparing HPC site to HPC site.

ieee international conference on high performance computing data and analytics | 2008

New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high- T c superconductors

Gonzalo Alvarez; Michael S. Summers; Don Maxwell; Markus Eisenbach; Jeremy S. Meredith; Jeffrey M. Larkin; John M. Levesque; Thomas A. Maier; Paul R. C. Kent; Eduardo F. D'Azevedo; Thomas C. Schulthess

Staggering computational and algorithmic advances in recent years now make possible systematic Quantum Monte Carlo (QMC) simulations of high temperature (high-Tc) superconductivity in a microscopic model, the two dimensional (2D) Hubbard model, with parameters relevant to the cuprate materials. Here we report the algorithmic and computational advances that enable us to study the effect of disorder and nano-scale inhomogeneities on the pair-formation and the superconducting transition temperature necessary to understand real materials. The simulation code is written with a generic and extensible approach and is tuned to perform well at scale. Significant algorithmic improvements have been made to make effective use of current supercomputing architectures. By implementing delayed Monte Carlo updates and a mixed single-/double precision mode, we are able to dramatically increase the efficiency of the code. On the Cray XT4 systems of the Oak Ridge National Laboratory (ORNL), for example, we currently run production jobs on 31 thousand processors and thereby routinely achieve a sustained performance that exceeds 200 TFIop/s. On a system with 49 thousand processors we achieved a sustained performance of 409 TFIop/s. We present a study of how random disorder in the effective Coulomb interaction strength affects the superconducting transition temperature in the Hubbard model.

Concurrency and Computation: Practice and Experience | 2018

Are we witnessing the spectre of an HPC meltdown?: Are We Witnessing the Spectre of an HPC Meltdown?

Verónica G. Vergara Larrea; Michael J. Brim; Wayne Joubert; Swen Boehm; Matthew B. Baker; Oscar R. Hernandez; Sarp Oral; James A Simmons; Don Maxwell

We measure and analyze the performance observed when running applications and benchmarks before and after the Meltdown and Spectre fixes have been applied to the Cray supercomputers and supporting systems at the Oak Ridge Leadership Computing Facility (OLCF). Of particular interest is the effect of these fixes on applications selected from the OLCF portfolio when running at scale. This comprehensive study presents results from experiments run on Titan, Eos, Cumulus, and Percival supercomputers at the OLCF. The results from this study are useful for HPC users running on Cray supercomputers and serve to better understand the impact that these two vulnerabilities have on diverse HPC workloads at scale.

ieee international conference on high performance computing, data, and analytics | 2017

Experiences Evaluating Functionality and Performance of IBM POWER8+ Systems

Verónica G. Vergara Larrea; Wayne Joubert; M. Berrill; Swen Boehm; Arnold N. Tharrington; Wael R. Elwasif; Don Maxwell

In preparation for Summit, Oak Ridge National Laboratory’s next generation supercomputer, two IBM Power-based systems were deployed in late 2016 at the Oak Ridge Leadership Computing Facility (OLCF). This paper presents a detailed description of the acceptance of the first IBM Power-based early access systems installed at the OLCF. The two systems, Summitdev and Tundra, contain IBM POWER8+ processors with NVIDIA Pascal GPUs and were acquired to provide researchers with a platform to optimize codes for the Power architecture. In addition, this paper presents early functional and performance results obtained on Summitdev with the latest software stack available.

dependable systems and networks | 2015