Don Maxwell
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Don Maxwell.
high-performance computer architecture | 2015
Devesh Tiwari; Saurabh Gupta; James H. Rogers; Don Maxwell; Paolo Rech; Sudharshan S. Vazhkudai; Daniel Oliveira; Dave Londo; Nathan DeBardeleben; Philippe Olivier Alexandre Navaux; Luigi Carro; Arthur S. Bland
Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the worlds second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.
ieee international conference on high performance computing data and analytics | 2015
Devesh Tiwari; Saurabh Gupta; George Gallarno; James H. Rogers; Don Maxwell
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The worlds second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
international supercomputing conference | 2013
Michael K. Patterson; Stephen W. Poole; Chung-Hsing Hsu; Don Maxwell; William Tschudi; Henry Coles; David Martinez; Natalie J. Bates
The metric, Power Usage Effectiveness (PUE), has been successful in improving energy efficiency of data centers, but it is not perfect. One challenge is that PUE does not account for the power distribution and cooling losses inside IT equipment. This is particularly problematic in the HPC (high performance computing) space where system suppliers are moving cooling and power subsystems into or out of the cluster. This paper proposes two new metrics: ITUE (IT-power usage effectiveness), similar to PUE but “inside” the IT and TUE (total-power usage effectiveness), which combines the two for a total efficiency picture. We conclude with a demonstration of the method, and a case study of measurements at ORNL’s Jaguar system. TUE provides a ratio of total energy, (internal and external support energy uses) and the specific energy used in the HPC. TUE can also be a means for comparing HPC site to HPC site.
ieee international conference on high performance computing data and analytics | 2008
Gonzalo Alvarez; Michael S. Summers; Don Maxwell; Markus Eisenbach; Jeremy S. Meredith; Jeffrey M. Larkin; John M. Levesque; Thomas A. Maier; Paul R. C. Kent; Eduardo F. D'Azevedo; Thomas C. Schulthess
Staggering computational and algorithmic advances in recent years now make possible systematic Quantum Monte Carlo (QMC) simulations of high temperature (high-Tc) superconductivity in a microscopic model, the two dimensional (2D) Hubbard model, with parameters relevant to the cuprate materials. Here we report the algorithmic and computational advances that enable us to study the effect of disorder and nano-scale inhomogeneities on the pair-formation and the superconducting transition temperature necessary to understand real materials. The simulation code is written with a generic and extensible approach and is tuned to perform well at scale. Significant algorithmic improvements have been made to make effective use of current supercomputing architectures. By implementing delayed Monte Carlo updates and a mixed single-/double precision mode, we are able to dramatically increase the efficiency of the code. On the Cray XT4 systems of the Oak Ridge National Laboratory (ORNL), for example, we currently run production jobs on 31 thousand processors and thereby routinely achieve a sustained performance that exceeds 200 TFIop/s. On a system with 49 thousand processors we achieved a sustained performance of 409 TFIop/s. We present a study of how random disorder in the effective Coulomb interaction strength affects the superconducting transition temperature in the Hubbard model.
Concurrency and Computation: Practice and Experience | 2018
Verónica G. Vergara Larrea; Michael J. Brim; Wayne Joubert; Swen Boehm; Matthew B. Baker; Oscar R. Hernandez; Sarp Oral; James A Simmons; Don Maxwell
We measure and analyze the performance observed when running applications and benchmarks before and after the Meltdown and Spectre fixes have been applied to the Cray supercomputers and supporting systems at the Oak Ridge Leadership Computing Facility (OLCF). Of particular interest is the effect of these fixes on applications selected from the OLCF portfolio when running at scale. This comprehensive study presents results from experiments run on Titan, Eos, Cumulus, and Percival supercomputers at the OLCF. The results from this study are useful for HPC users running on Cray supercomputers and serve to better understand the impact that these two vulnerabilities have on diverse HPC workloads at scale.
ieee international conference on high performance computing, data, and analytics | 2017
Verónica G. Vergara Larrea; Wayne Joubert; M. Berrill; Swen Boehm; Arnold N. Tharrington; Wael R. Elwasif; Don Maxwell
In preparation for Summit, Oak Ridge National Laboratory’s next generation supercomputer, two IBM Power-based systems were deployed in late 2016 at the Oak Ridge Leadership Computing Facility (OLCF). This paper presents a detailed description of the acceptance of the first IBM Power-based early access systems installed at the OLCF. The two systems, Summitdev and Tundra, contain IBM POWER8+ processors with NVIDIA Pascal GPUs and were acquired to provide researchers with a platform to optimize codes for the Power architecture. In addition, this paper presents early functional and performance results obtained on Summitdev with the latest software stack available.
dependable systems and networks | 2015
Saurabh Gupta; Devesh Tiwari; Christopher Jantzi; James H. Rogers; Don Maxwell
Archive | 2010
Sarp Oral; Feiyi Wang; David A Dillow; Ross Miller; Galen M. Shipman; Don Maxwell; Dave Henseler; Jeff Becklehimer; Jeff Larkin
Archive | 2010
Ross Miller; Jason J Hill; David A Dillow; Raghul Gunasekaran; Don Maxwell
Archive | 2014
Matthew A Ezell; David A Dillow; H Sarp Oral; Feiyi Wang; Devesh Tiwari; Don Maxwell; Dustin B Leverman; Jason J Hill