John A. Gunnels | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John A. Gunnels is active.

Explore More

Publication

Featured researches published by John A. Gunnels.

ACM Transactions on Mathematical Software | 2001

FLAME: Formal Linear Algebra Methods Environment

John A. Gunnels; Fred G. Gustavson; Greg Henry; Robert A. van de Geijn

Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional approaches to implementation. Attempts to employ traditional methodology have led, in our opinion, to the production of an abundance of anfractuous code that is difficult to maintain and almost impossible to upgrade.Having struggled with these issues for more than a decade, we have concluded that a solution is to apply a technique from theoretical computer science, formal derivation, to the development of high-performance linear algebra libraries. We think the resulting approach results in aesthetically pleasing, coherent code that greatly facilitates intelligent modularity and high performance while enhancing confidence in its correctness. Since the technique is language-independent, it lends itself equally well to a wide spectrum of programming languages (and paradigms) ranging from C and Fortran to C++ and Java. In this paper, we illustrate our observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures. This environment demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world.We present performance experiments on the Intel (R) Pentium (R) III processor that demonstrate that high performance can be attained by coding at a high level of abstraction.

ACM Transactions on Mathematical Software | 2005

The science of deriving dense linear algebra algorithms

Paolo Bientinesi; John A. Gunnels; Margaret E. Myers; Enrique S. Quintana-Ortí; Robert A. van de Geijn

In this article we present a systematic approach to the derivation of families of high-performance algorithms for a large set of frequently encountered dense linear algebra operations. As part of the derivation a constructive proof of the correctness of the algorithm is generated. The article is structured so that it can be used as a tutorial for novices. However, the method has been shown to yield new high-performance algorithms for well-studied linear algebra operations and should also be of interest to those who wish to produce best-in-class high-performance codes.

conference on high performance computing (supercomputing) | 2007

Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

James N. Glosli; David F. Richards; Kyle Caspersen; Robert E. Rudd; John A. Gunnels; Frederick H. Streitz

We report the computational advances that have enabled the first micron-scale simulation of a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in three key areas for massively parallel computation such as on BlueGene/L (BG/L): fault tolerance, application kernel optimization, and highly efficient parallel I/O. In particular, we have developed novel capabilities for handling hardware parity errors and improving the speed of interatomic force calculations, while achieving near optimal I/O speeds on BG/L, allowing us to achieve excellent scalability and improve overall application performance. As a result we have successfully conducted a 2-billion atom KH simulation amounting to 2.8 CPU-millennia of run time, including a single, continuous simulation run in excess of 1.5 CPU-millennia. We have also conducted 9-billion and 62.5-billion atom KH simulations. The current optimized ddcMD code is benchmarked at 115.1 TFlop/s in our scaling study and 103.9 TFlop/s in a sustained science run, with additional improvements ongoing. These improvements enabled us to run the first MD simulations of micron-scale systems developing the KH instability.

acm symposium on parallel algorithms and architectures | 2007

An experimental comparison of cache-oblivious and cache-conscious programs

Kamen Yotov; Thomas Roeder; Keshav Pingali; John A. Gunnels; Fred G. Gustavson

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy. An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question. This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.

Ibm Journal of Research and Development | 2005

Design and implementation of message-passing services for the Blue Gene/L supercomputer

George S. Almasi; Charles J. Archer; José G. Castaños; John A. Gunnels; C. Christopher Erway; Philip Heidelberger; Xavier Martorell; José E. Moreira; Kurt Walter Pinnow; Joe Ratterman; Burkhard Steinmacher-Burow; William Gropp; Brian R. Toonen

The Blue Gene®/L (BG/L) supercomputer, with 65,536 dual-processor compute nodes, was designed from the ground up to support efficient execution of massively parallel message-passing programs. Part of this support is an optimized implementation of the Message Passing Interface (MPI), which leverages the hardware features of BG/L. MPI for BG/L is implemented on top of a more basic message-passing infrastructure called the message layer. This message layer can be used both to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two BG/L modes of operation: the coprocessor mode and the virtual node mode. Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.

conference on high performance computing (supercomputing) | 2006

Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform

Francois Gygi; Erik W. Draeger; Martin Schulz; Bronis R. de Supinski; John A. Gunnels; Vernon Austel; James C. Sexton; Franz Franchetti; Stefan Kral; Christoph W. Ueberhuber; Juergen Lorenz

First-principles simulations of high-Z metallic systems using the Qbox code on the BlueGene/L supercomputer demonstrate unprecedented performance and scaling for a quantum simulation code. Specifically designed to take advantage of massively-parallel systems like BlueGene/L, Qbox demonstrates excellent parallel efficiency and peak performance. A sustained peak performance of 207.3 TFlop/s was measured on 65,536 nodes, corresponding to 56.5% of the theoretical full machine peak using all 128k CPUs.

Ibm Journal of Research and Development | 2015

Active Memory Cube: A processing-in-memory architecture for exascale systems

Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing

conference on high performance computing (supercomputing) | 2005

Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code

Francois Gygi; Robert Kim Yates; Juergen Lorenz; Erik W. Draeger; Franz Franchetti; Christoph W. Ueberhuber; Bronis R. de Supinski; Stefan Kral; John A. Gunnels; James C. Sexton

10^{18}

international conference on supercomputing | 2009

MPI collective communications on the blue gene/p supercomputer: algorithms and optimizations

Ahmad Faraj; Sameer Kumar; Brian E. Smith; Amith R. Mamidala; John A. Gunnels; Philip Heidelberger

floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.

Ibm Journal of Research and Development | 2005

Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

Siddhartha Chatterjee; L. R. Bachega; Peter Bergner; K. A. Dockser; John A. Gunnels; Manish Gupta; Fred G. Gustavson; Christopher A. Lapkowski; G. K. Liu; Mark P. Mendell; Ravi Nair; C. D. Wait; T. J. C. Ward; Philip T. Wu

We demonstrate that the Qbox code supports unprecedented large-scale First-Principles Molecular Dynamics (FPMD) applications on the BlueGene/L supercomputer. Qbox is an FPMD implementation specifically designed for large-scale parallel platforms such as BlueGene/L. Strong scaling tests for a Materials Science application show an 86% scaling efficiency between 1024 and 32,768 CPUs. Measurements of performance by means of hardware counters show that 36% of the peak FPU performance can be attained.

Explore More