H. Carter Edwards | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where H. Carter Edwards is active.

Explore More

Publication

Featured researches published by H. Carter Edwards.

Archive | 2009

Improving performance via mini-applications.

Sandia Report; Michael A. Heroux; Douglas W. Doerfler; Paul S. Crozier; James M. Willenbring; H. Carter Edwards; Alan B. Williams; Mahesh Rajan; Eric R. Keiter; Heidi K. Thorn; Robert W. Numrich

Application performance is determined by a combination of many choices: hardware platform, runtime environment, languages and compilers used, algorithm choice and implementation, and more. In this complicated environment, we find that the use of mini-applications - small self-contained proxies for real applications - is an excellent approach for rapidly exploring the parameter space of all these choices. Furthermore, use of mini-applications enriches the interaction between application, library and computer system developers by providing explicit functioning software and concrete performance results that lead to detailed, focused discussions of design trade-offs, algorithm choices and runtime performance issues. In this paper we discuss a collection of mini-applications and demonstrate how we use them to analyze and improve application performance on new and future computer platforms.

parallel, distributed and network-based processing | 2010

A Light-weight API for Portable Multicore Programming

Christopher G. Baker; Michael A. Heroux; H. Carter Edwards; Alan B. Williams

Multicore nodes have become ubiquitous in just a few years. At the same time, writing portable parallel software for multicore nodes is extremely challenging. Widely available programming models such as OpenMP and Pthreads are not useful for devices such as graphics cards, and more flexible programming models such as RapidMind are only available commercially. OpenCL represents the first truly portable standard, but its availability is limited. In the presence of such transition, we have developed a minimal application programming interface (API) for multicore nodes that allows us to write portable parallel linear algebra software that can use any of the aforementioned programming models and any future standard models. We utilize C++ template meta-programming to enable users to write parallel kernels that can be executed on a variety of node types, including Cell, GPUs and multicore CPUs. The support for a parallel node is provided by implementing a Node object, according to the requirements specified by the API. This ability to provide custom support for particular node types gives developers a level of control not allowed by the current slate of proprietary parallel programming APIs. We demonstrate implementations of the API for a simple vector dot-product on sequential CPU, multicore CPU and GPU nodes.

Archive | 2003

The SIERRA Framework for Developing Advanced Parallel Mechanics Applications

James R. Stewart; H. Carter Edwards

SIERRA is an object-oriented computational fram ework currently under intense development at Sandia National Laboratories. The SIERRA framework is a set of software services and parallel data structures upon which many mechanics applications can be written. The main goal is to bring together distributed mesh management, field management (i.e., the variables), and mechanics algorithm support services to facilitate rapid code development and code reuse. The motivation behind SIERRA is that designing and implementing modern applications in a multiphysics, parallel environment is very difficult, and involves advanced computer science as well as engineering mechanics. By providing the common, mostly computer science related services in a framework, they can be developed once and then used by many different mechanics codes. Although SIERRA is written in C++, the mechanics codes using SIERRA can be written in any language such as Fortran or C.

programming models and applications for multicores and manycores | 2012

Kokkos Array performance-portable manycore programming model

H. Carter Edwards; Daniel Sunderland

Large, complex scientific and engineering application code have a significant investment in computational kernels which implement their mathematical models. Porting these computational kernels to multicore-CPU and manycore-accelerator (e.g., NVIDIA® GPU) devices is a major challenge given the diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach for implementing computational kernels that are performance-portable to multicore-CPU and manycore-accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel computational kernels, and (3) multidimensional arrays. Performance-portability is achieved by decoupling computational kernels from device-specific data access performance requirements (e.g., NVIDIA coalesced memory access) through an intuitive multidimensional array API. The Kokkos Array API uses C++ template meta-programming to, at compile time, transparently insert device-optimal data access maps into computational kernels. With this programming model computational kernels can be written once and, without modification, performance-portably compiled to multicore-CPU and manycore-accelerator devices.

Scientific Programming | 2012

Manycore performance-portability: Kokkos multidimensional array library

H. Carter Edwards; Daniel Sunderland; Vicki L. Porter; Chris Amsler; Sam Mish

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces APIs, and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: 1 manycore compute devices each with its own memory space, 2 data parallel kernels and 3 multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices --potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by 1 separating data access patterns from computational kernels through a multidimensional array API and 2 introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].

Engineering With Computers | 2006

Managing complexity in massively parallel, adaptive, multiphysics applications

H. Carter Edwards

A new generation of scientific and engineering applications are being developed to support multiple coupled physics, adaptive meshes, and scaling in massively parallel environments. The capabilities required to support multiphysics, adaptivity, and massively parallel execution are individually complex and are especially challenging to integrate within a single application. Sandia National Laboratories has managed this challenge by consolidating these complex physics-independent capabilities into the Sierra Framework which is shared among a diverse set of application codes. The success of the Sierra Framework has been predicated on managing the integration of complex capabilities through a conceptual model based upon formal mathematical abstractions. Set theory is used to express and analyze the data structures, operations, and interactions of these complex capabilities. This mathematically based, conceptual modeling approach to managing complexity is not specific to the Sierra Framework—it is generally applicable to any scientific and engineering application framework.

IEEE Transactions on Parallel and Distributed Systems | 2017

Trends in Data Locality Abstractions for HPC Systems

Didem Unat; Anshu Dubey; Torsten Hoefler; John Shalf; Mark James Abraham; Mauro Bianco; Bradford L. Chamberlain; Romain Cledat; H. Carter Edwards; Hal Finkel; Karl Fuerlinger; Frank Hannig; Emmanuel Jeannot; Amir Kamil; Jeff Keasler; Paul H. J. Kelly; Vitus J. Leung; Hatem Ltaief; Naoya Maruyama; Chris J. Newburn; Miquel Pericàs

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.

International Journal of Computer Mathematics | 2014

Exploring emerging manycore architectures for uncertainty quantification through embedded stochastic Galerkin methods

Eric Todd Phipps; H. Carter Edwards; Jonathan Joseph Hu; Jakob T. Ostien

We explore approaches for improving the performance of intrusive or embedded stochastic Galerkin uncertainty quantification methods on emerging computational architectures. Our work is motivated by the trend of increasing disparity between floating-point throughput and memory access speed on emerging architectures, thus requiring the design of new algorithms with memory access patterns more commensurate with computer architecture capabilities. We first compare the traditional approach for implementing stochastic Galerkin methods to non-intrusive spectral projection methods employing high-dimensional sparse quadratures on relevant problems from computational mechanics, and demonstrate the performance of stochastic Galerkin is reasonable. Several reorganizations of the algorithm with improved memory access patterns are described and their performance measured on contemporary manycore architectures. We demonstrate these reorganizations can lead to improved performance for matrix–vector products needed by iterative linear system solvers, and highlight further algorithm research that might lead to even greater performance.

arXiv: Mathematical Software | 2016

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Kyungjoo Kim; Sivasankaran Rajamanickam; George Stelle; H. Carter Edwards; Stephen L. Olivier

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-by-blocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Cholesky-by-blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.

international conference on cluster computing | 2011

Multicore/GPGPU Portable Computational Kernels via Multidimensional Arrays

H. Carter Edwards; Daniel Sunderland; Chris Amsler; Sam Mish

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern many core accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Trilinos-Kokkos array programming model provides library based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) there exists one or more many core compute devices each with its own memory space, (2) data parallel kernels are executed via parallel for and parallel reduce operations, and (3) kernels operate on multidimensional arrays. Kernel execution performance is, especially for NVIDIA R GPGPU devices, extremely dependent on data access patterns. An optimal data access pattern can be different for different many core devices -- potentially leading to different implementations of computational kernels specialized for different devices. The Trilinos-Kokkos programming model support performance-portable kernels by separating data access patterns from computational kernels through a multidimensional array API. Through this API device-specific mappings of multiindices to device memory are introduced into a computational kernel through compile-time polymorphism, i.e., without modification of the kernel.

Explore More