Daniel Sunderland
Sandia National Laboratories
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daniel Sunderland.
programming models and applications for multicores and manycores | 2012
H. Carter Edwards; Daniel Sunderland
Large, complex scientific and engineering application code have a significant investment in computational kernels which implement their mathematical models. Porting these computational kernels to multicore-CPU and manycore-accelerator (e.g., NVIDIA® GPU) devices is a major challenge given the diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach for implementing computational kernels that are performance-portable to multicore-CPU and manycore-accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel computational kernels, and (3) multidimensional arrays. Performance-portability is achieved by decoupling computational kernels from device-specific data access performance requirements (e.g., NVIDIA coalesced memory access) through an intuitive multidimensional array API. The Kokkos Array API uses C++ template meta-programming to, at compile time, transparently insert device-optimal data access maps into computational kernels. With this programming model computational kernels can be written once and, without modification, performance-portably compiled to multicore-CPU and manycore-accelerator devices.
Scientific Programming | 2012
H. Carter Edwards; Daniel Sunderland; Vicki L. Porter; Chris Amsler; Sam Mish
Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces APIs, and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: 1 manycore compute devices each with its own memory space, 2 data parallel kernels and 3 multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices --potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by 1 separating data access patterns from computational kernels through a multidimensional array API and 2 introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].
international parallel and distributed processing symposium | 2016
Alan Humphrey; Daniel Sunderland; Todd Harman; Martin Berzins
Modeling thermal radiation is computationally challenging in parallel due to its all-to-all physical and resulting computational connectivity, and is also the dominant mode of heat transfer in practical applications such as next-generation clean coal boilers, being modeled by the Uintah framework. However, a direct all-to-all treatment of radiation is prohibitively expensive on large computers systems whether homogeneous or heterogeneous. DOE Titan and the planned DOE Summit and Sierra machines are examples of current and emerging GPU-based heterogeneous systems where the increased processing capability of GPUs over CPUs exacerbates this problem. These systems require that computational frameworks like Uintah leverage an arbitrary number of on-node GPUs, while simultaneously utilizing thousands of GPUs within a single simulation. We show that radiative heat transfer problems can be made to scale within Uintah on heterogeneous systems through a combination of reverse Monte Carlo ray tracing (RMCRT) techniques combined with AMR, to reduce the amount of global communication. In particular, significant Uintah infrastructure changes, including a novel lock and contention-free, thread-scalable data structure for managing MPI communication requests and improved memory allocation strategies were necessary to achieve excellent strong scaling results to 16384 GPUs on Titan.
international conference on cluster computing | 2011
H. Carter Edwards; Daniel Sunderland; Chris Amsler; Sam Mish
Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern many core accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Trilinos-Kokkos array programming model provides library based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) there exists one or more many core compute devices each with its own memory space, (2) data parallel kernels are executed via parallel for and parallel reduce operations, and (3) kernels operate on multidimensional arrays. Kernel execution performance is, especially for NVIDIA R GPGPU devices, extremely dependent on data access patterns. An optimal data access pattern can be different for different many core devices -- potentially leading to different implementations of computational kernels specialized for different devices. The Trilinos-Kokkos programming model support performance-portable kernels by separating data access patterns from computational kernels through a multidimensional array API. Through this API device-specific mappings of multiindices to device memory are introduced into a computational kernel through compile-time polymorphism, i.e., without modification of the kernel.
Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and Middleware | 2016
Daniel Sunderland; Brad Peterson; John A. Schmidt; Alan Humphrey; Jeremy Thornock; Martin Berzins
The current diversity in nodal parallel computer architectures is seen in machines based upon multicore CPUs, GPUs and the Intel Xeon Phis. A class of approaches for enabling scalability of complex applications on such architectures is based upon Asynchronous Many Task software architectures such as that in the Uintah framework used for the parallel solution of solid and fluid mechanics problems. Uintah has both an applications layer with its own programming model and a separate runtime system. While Uintah scales well today, it is necessary to address nodal performance portability in order for it to continue to do. Incrementally modifying Uintah to use the Kokkos performance portability library through prototyping experiments results in improved kernel performance by more than a factor of two.
Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact | 2017
John K. Holmen; Alan Humphrey; Daniel Sunderland; Martin Berzins
The University of Utahs Carbon Capture Multidisciplinary Simulation Center (CCMSC) is using the Uintah Computational Framework to predict performance of a 1000 MWe ultra-supercritical clean coal boiler. The center aims to utilize the Intel Xeon Phi-based DOE systems, Theta and Aurora, through the Aurora Early Science Program by using the Kokkos C++ library to enable node-level performance portability. This paper describes infrastructure advancements and portability improvements made possible by the integration of Kokkos within Uintah. This integration marks a step towards consolidating Uintahs MPI+PThreads and MPI+CUDA hybrid parallelism approaches into a single MPI+Kokkos approach. Scalability results are presented that compare serial and data parallel task execution models for a challenging radiative heat transfer calculation, central to the centers predictive boiler simulations. These results demonstrate both good strong-scaling characteristics to 256 Knights Landing (KNL) processors on the NSF Stampede system, and show the KNL-based calculation to compete with prior GPU-based results for the same calculation.
Journal of Computational Science | 2018
Brad Peterson; Alan Humphrey; John K. Holmen; Todd Harman; Martin Berzins; Daniel Sunderland; H. Carter Edwards
Abstract High performance computing frameworks utilizing CPUs, Nvidia GPUs, and/or Intel Xeon Phis necessitate portable and scalable solutions for application developers. Nvidia GPUs in particular present numerous portability challenges with a different programming model, additional memory hierarchies, and partitioned execution units among streaming multiprocessors. This work presents modifications to the Uintah asynchronous many-task runtime and the Kokkos portability library to enable one single codebase for complex multiphysics applications to run across different architectures. Scalability and performance results are shown on multiple architectures for a globally coupled radiation heat transfer simulation, ranging from a single node to 16,384 Titan compute nodes.
Journal of Parallel and Distributed Computing | 2014
H. Carter Edwards; Christian Robert Trott; Daniel Sunderland
Archive | 2014
H. Carter Edwards; Christian Robert Trott; Daniel Sunderland
Archive | 2011
Harold C Edwards; Todd S. Coffey; Daniel Sunderland; Alan B. Williams