Richard F. Barrett | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard F. Barrett is active.

Explore More

Publication

Featured researches published by Richard F. Barrett.

ieee international conference on high performance computing data and analytics | 2014

Abstract machine models and proxy architectures for exascale computing

James A. Ang; Richard F. Barrett; R.E. Benner; D. Burke; Cy P. Chan; Jeanine Cook; David Donofrio; Simon D. Hammond; Karl Scott Hemmert; Suzanne M. Kelly; H. Le; Vitus J. Leung; David Resnick; Arun Rodrigues; John Shalf; Dylan T. Stark; Didem Unat; Nicholas J. Wright

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes

Mahesh Rajan; Richard F. Barrett; Douglas Doerfler; Kevin Pedretti

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaigns newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Crays Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, and supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.

Future Generation Computer Systems | 2014

Exascale design space exploration and co-design

Sudip S. Dosanjh; Richard F. Barrett; Douglas Doerfler; Simon D. Hammond; Karl Scott Hemmert; Michael A. Heroux; Paul Lin; Kevin Pedretti; Arun Rodrigues; Tim Trucano; Justin Luitjens

The co-design of architectures and algorithms has been postulated as a strategy for achieving Exascale computing in this decade. Exascale design space exploration is prohibitively expensive, at least partially due to the size and complexity of scientific applications of interest. Application codes can contain millions of lines and involve many libraries. Mini-applications, which attempt to capture some key performance issues, can potentially reduce the order of the exploration by a factor of a thousand. However, we need to carefully understand how representative mini-applications are of the full application code. This paper describes a methodology for this comparison and applies it to a particularly challenging mini-application. A multi-faceted methodology for design space exploration is also described that includes measurements on advanced architecture testbeds, experiments that use supercomputers and system software to emulate future hardware, and hardware/software co-simulation tools to predict the behavior of applications on hardware that does not yet exist.

ieee international conference on high performance computing data and analytics | 2011

Poster: mini-applications: vehicles for co-design

Richard F. Barrett; Michael A. Heroux; Paul Lin; Alan B. Williams

Application performance is determined by a combination of many choices: hardware plat-form, runtime environment, languages and compilers used, algorithm choice and implementation, and more. In this complicated environment, we find that the use of mini-applications - small self-contained proxies for real applications - is an excellent approach for rapidly exploring the parameter space of all these choices. Furthermore, use of mini-applications enriches the interaction between application, library and computer system developers by providing explicit functioning software and concrete performance results that lead to detailed, focused discussions of design trade-offs, algorithm choices and runtime performance issues. In this poster we discuss a collection of mini-applications and demonstrate how we use them to analyze and improve application performance on new and future computer platforms

ieee international conference on high performance computing data and analytics | 2012

Navigating an Evolutionary Fast Path to Exascale

Richard F. Barrett; Simon D. Hammond; Douglas Doerfler; Michael A. Heroux; Justin Luitjens; Duncan Roweth

The computing community is in the midst of a disruptive architectural change. The advent of manycore and heterogeneous computing nodes forces us to reconsider every aspect of the system software and application stack. To address this challenge there is a broad spectrum of approaches, which we roughly classify as either revolutionary or evolutionary. With the former, the entire code base is re-written, perhaps using a new programming language or execution model. The latter, which is the focus of this work, seeks a piecewise path of effective incremental change. The end effect of our approach will be revolutionary in that the control structure of the application will be markedly different in order to utilize single-instruction multiple-data/thread (SIMD/SIMT), manycore and heterogeneous nodes, but the physics code fragments will be remarkably similar. Our approach is guided by a set of mission driven applications and their proxies, focused on balancing performance potential with the realities of existing application code bases. Although the specifics of this process have not yet converged, we find that there are several important steps that developers of scientific and engineering application programs can take to prepare for making effective use of these challenging platforms. Aiding an evolutionary approach is the recognition that the performance potential of the architectures is, in a meaningful sense, an extension of existing capabilities: vectorization, threading, and a re-visiting of node interconnect capabilities. Therefore, as architectures, programming models, and programming mechanisms continue to evolve, the preparations described herein will provide significant performance benefits on existing and emerging architectures.

2014 Workshop on Exascale MPI at Supercomputing Conference | 2014

Early experiences co-scheduling work and communication tasks for hybrid MPI+X applications

Dylan T. Stark; Richard F. Barrett; Ryan E. Grant; Stephen L. Olivier; Kevin Pedretti

Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.

Journal of Parallel and Distributed Computing | 2015

Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications

Richard F. Barrett; Paul S. Crozier; Douglas Doerfler; Michael A. Heroux; Paul Lin; Heidi K. Thornquist; Tim Trucano

Computational science and engineering application programs are typically large, complex, and dynamic, and are often constrained by distribution limitations. As a means of making tractable rapid explorations of scientific and engineering application programs in the context of new, emerging, and future computing architectures, a suite of miniapps has been created to serve as proxies for full scale applications. Each miniapp is designed to represent a key performance characteristic that does or is expected to significantly impact the runtime performance of an application program. In this paper we introduce a methodology for assessing the ability of these miniapps to effectively represent these performance issues. We applied this methodology to three miniapps, examining the linkage between them and an application they are intended to represent. Herein we evaluate the fidelity of that linkage. This work represents the initial steps required to begin to answer the question, Under what conditions does a miniapp represent a key performance characteristic in a full app? Proxies are being used to examine the performance of full application codes.We present a methodology for showing the link between full application codes and their proxies.We demonstrate this methodology using four applications and their proxies.

programming models and applications for multicores and manycores | 2015

Toward an evolutionary task parallel integrated MPI + X programming model

Richard F. Barrett; Dylan T. Stark; Ryan E. Grant; Stephen L. Olivier; Kevin Pedretti

The Bulk Synchronous Parallel programming model is showing performance limitations at high processor counts. We propose over-decomposition of the domain, operated on as tasks, to smooth out utilization of the computing resource, in particular the node interconnect and processing cores, and hide intra- and inter-node data movement. Our approach maintains the existing coding style commonly employed in computational science and engineering applications. Although we show improved performance on existing computers, up to 131,072 processor cores, the effectiveness of this approach on expected future architectures will require the continued evolution of capabilities throughout the codesign stack. Success then will not only result in decreased time to solution, but would also make better use of the hardware capabilities and reduce power and energy requirements, while fundamentally maintaining the current code configuration strategy.

application specific systems architectures and processors | 2013

GPU acceleration of Data Assembly in Finite Element Methods and its energy implications

Li Tang; X. Sharon Hu; Danny Z. Chen; Michael Niemier; Richard F. Barrett; Simon D. Hammond; Genie Hsieh

The Finite Element Method (FEM) is a numerical technique widely used in finding approximate solutions for many scientific and engineering problems. The Data Assembly (DA) stage in FEM can take up to 50% of the total FEM execution time. Accelerating DA with Graphics Processing Units (GPUs) presents challenges due to DAs mixed compute-intensive and memory-intensive workloads. This paper uses a representative finite element mini-application to explore DA acceleration on CPU+GPU platforms. Implementations based on different thread, kernel and task design approaches are developed and compared. Their performance and energy consumption are measured on four CPU+GPU and two CPU only platforms. The results show that (i) the performance and energy for different implementations on the same platform can vary significantly but the performance and energy trends are the same, and (ii) there exist performance and energy tradeoffs across some platforms if the best implementation is chosen for each of the platforms.

Concurrency and Computation: Practice and Experience | 2012

Application-driven analysis of two generations of capability computing: the transition to multicore processors

Mahesh Rajan; Douglas Doerfler; Richard F. Barrett; Paul Lin; Kevin Pedretti; K. Scott Hemmert

Multicore processors form the basis of most traditional high performance parallel processing architectures. Early experiences with these computers showed significant performance problems, both with regard to computation and inter‐process communication. The transition from Purple, an IBM POWER5‐based machine, to Cielo, a Cray XE6, as the main capability computing platform for the United States Department of Energys Advanced Simulation and Computing campaign provides an opportunity to reexamine these issues after experiences with a few generations of multicore‐based machines. Experiences with Purple identified some important characteristics that led to strong performance of complex scientific application programs at very large scales. Herein, we compare the performance of some Advanced Simulation and Computing mission critical applications at capability scale across this transition to multicore processors. Copyright

Explore More