Matthew R. Norman
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Matthew R. Norman.
ieee international conference on high performance computing data and analytics | 2013
I. Carpenter; Rick Archibald; Katherine J. Evans; Jeffrey M. Larkin; Paulius Micikevicius; Matthew R. Norman; J. Rosinski; Jim Schwarzmeier; Mark A. Taylor
The suitability of a spectral element based dynamical core (HOMME) within the Community Atmospheric Model (CAM) for GPU-based architectures is examined and initial performance results are reported. This work was done within a project to enable CAM to run at high resolution on next-generation, multi-petaflop systems. The dynamical core is the present focus because it dominates the performance profile of our target problem. HOMME enjoys good scalability due to its underlying cubed-sphere mesh with full two-dimensional decomposition and the localization of all computational work within each element. The thread blocking and code changes that allow HOMME to effectively use GPUs are described along with a rewritten vertical remapping scheme, which improves performance on both CPUs and GPUs. Validation of results in the full HOMME model is also described. We demonstrate that the most expensive kernel in the model executes more than three times faster on the GPU than the CPU. These improvements are expected to provide improved efficiency when incorporated into the full model that has been configured for the target problem. Remaining issues affecting performance include optimizing the boundary exchanges for the case of multiple spectral elements being computed on the GPU.
Journal of Computational Science | 2015
Matthew R. Norman; Jeffrey M. Larkin; Aaron Vose; Katherine J. Evans
Abstract The porting of a key kernel in the tracer advection routines of the Community Atmosphere Model – Spectral Element (CAM-SE) to use Graphics Processing Units (GPUs) using OpenACC is considered in comparison to an existing CUDA FORTRAN port. The development of the OpenACC kernel for GPUs was substantially simpler than that of the CUDA port. Also, OpenACC performance was about 1.5× slower than the optimized CUDA version. Particular focus is given to compiler maturity regarding OpenACC implementation for modern FORTRAN, and it is found that the Cray implementation is currently more mature than the PGI implementation. Still, for the case that ran successfully on PGI, the PGI OpenACC runtime was slightly faster than Cray. The results show encouraging performance for OpenACC implementation compared to CUDA while also exposing some issues that may be necessary before the implementations are suitable for porting all of CAM-SE. Most notable are that GPU shared memory should be used by future OpenACC implementations and that derived type support should be expanded.
Computers & Electrical Engineering | 2015
Wayne Joubert; Richard K Archibald; M. Berrill; W. Michael Brown; Markus Eisenbach; Ray W. Grout; Jeff Larkin; John M. Levesque; Bronson Messer; Matthew R. Norman; Bobby Philip; Ramanan Sankaran; Arnold N. Tharrington; John A. Turner
Display Omitted Lessons learned are given for moving applications to the GPU-based Titan system.A carefully managed readiness effort is essential to preparing for new hardware.Applications typically require code restructuring to port to accelerators.Exposing more parallelism and minimizing data traffic are common porting themes.Performance gains of 2X-7X have been realized for application codes on Titan. The use of computational accelerators such as NVIDIA GPUs and Intel Xeon Phi processors is now widespread in the high performance computing community, with many applications delivering impressive performance gains. However, programming these systems for high performance, performance portability and software maintainability has been a challenge. In this paper we discuss experiences porting applications to the Titan system. Titan, which began planning in 2009 and was deployed for general use in 2013, was the first multi-petaflop system based on accelerator hardware. To ready applications for accelerated computing, a preparedness effort was undertaken prior to delivery of Titan. In this paper we report experiences and lessons learned from this process and describe how users are currently making use of computational accelerators on Titan.
Journal of Computational Physics | 2012
Matthew R. Norman; Hal Finkel
A new integration method combining the ADER time discretization with a multi-moment finite-volume framework is introduced. ADER runtime is reduced by performing only one Cauchy-Kowalewski (C-K) procedure per cell per time step and by using the Differential Transform Method for high-order derivatives. Three methods are implemented: (1) single-moment WENO (WENO), (2) two-moment Hermite WENO (HWENO), and (3) entirely local multi-moment (MM-Loc). MM-Loc evolves all moments, sharing the locality of Galerkin methods yet with a constant time step during p-refinement.Five 1-D experiments validate the methods: (1) linear advection, (2) Burgers equation shock, (3) transient shallow-water (SW), (4) steady-state SW simulation, and (5) SW shock. WENO and HWENO methods showed expected polynomial h-refinement convergence and successfully limited oscillations for shock experiments. MM-Loc showed expected polynomial h-refinement and exponential p-refinement convergence for linear advection and showed sub-exponential (yet super-polynomial) convergence with p-refinement in the SW case.HWENO accuracy was generally equal to or better than a five-moment MM-Loc scheme. MM-Loc was less accurate than RKDG at lower refinements, but with greater h- and p-convergence, RKDG accuracy is eventually surpassed. The ADER time integrator of MM-Loc also proved more accurate with p-refinement at a CFL of unity than a semi-discrete RK analog of MM-Loc. Being faster in serial and requiring less frequent inter-node communication than Galerkin methods, the ADER-based MM-Loc and HWENO schemes can be spatially refined and have the same runtime, making them a competitive option for further investigation.
Journal of Computational Physics | 2014
Matthew R. Norman
The novel ADER-DT time discretization is applied to two-dimensional transport in a quadrature-free, WENO- and FCT-limited, Finite-Volume context. Emphasis is placed on (1) the serial and parallel computational properties of ADER-DT and this framework and (2) the flexibility of ADER-DT and this framework in efficiently balancing accuracy with other constraints important to transport applications. This study demonstrates a range of choices for the user when approaching their specific application while maintaining good parallel properties. In this method, genuine multi-dimensionality, single-step and single-stage time stepping, strict positivity, and a flexible range of limiting are all achieved with only one parallel synchronization and data exchange per time step. In terms of parallel data transfers per simulated time interval, this improves upon multi-stage time stepping and post-hoc filtering techniques such as hyperdiffusion. This method is evaluated with standard transport test cases over a range of limiting options to demonstrate quantitatively and qualitatively what a user should expect when employing this method in their application.
Journal of Computational Physics | 2015
Matthew R. Norman
This is among the first applications of Hermite WENO limiting to an ADER time discretization.A single HWENO-limited polynomial evolves both the state values and derivatives.These HWENO methods do not relax to a first-order Godunov scheme for steep discontinuities.These HWENO limiters are much more flexible than most WENO limiters, able to be derived based on any set of polynomials. New Hermite Weighted Essentially Non-Oscillatory (HWENO) interpolants are developed and investigated within the Multi-Moment Finite-Volume (MMFV) formulation using the ADER-DT time discretization. Whereas traditional WENO methods interpolate pointwise, function-based WENO methods explicitly form a non-oscillatory, high-order polynomial over the cell in question. This study chooses a function-based approach and details how fast convergence to optimal weights for smooth flow is ensured. Methods of sixth-, eighth-, and tenth-order accuracy are developed. These are compared against traditional single-moment WENO methods of fifth-, seventh-, ninth-, and eleventh-order accuracy to compare against more familiar methods from literature. The new HWENO methods improve upon existing HWENO methods (1) by giving a better resolution of unreinforced contact discontinuities and (2) by only needing a single HWENO polynomial to update both the cell mean value and cell mean derivative.Test cases to validate and assess these methods include 1-D linear transport, the 1-D inviscid Burgers equation, and the 1-D inviscid Euler equations. Smooth and non-smooth flows are used for evaluation. These HWENO methods performed better than comparable literature-standard WENO methods for all regimes of discontinuity and smoothness in all tests herein. They exhibit improved optimal accuracy due to the use of derivatives, and they collapse to solutions similar to typical WENO methods when limiting is required. The study concludes that the new HWENO methods are robust and effective when used in the ADER-DT MMFV framework. These results are intended to demonstrate capability rather than exhaust all possible implementations.
international conference on conceptual structures | 2017
Salil Mahajan; Abigail L. Gaddis; Katherine J. Evans; Matthew R. Norman
Abstract A strict throughput requirement has placed a cap on the degree to which we can depend on the execution of single, long, fine spatial grid simulations to explore global atmospheric climate behavior. Alternatively, running an ensemble of short simulations is computationally more efficient. We test the null hypothesis that the climate statistics of a full-complexity atmospheric model derived from an ensemble of independent short simulation is equivalent to that from an equilibrated long simulation. The climate of short simulation ensembles is statistically distinguishable from that of a long simulation in terms of the distribution of global annual means, largely due to the presence of low-frequency atmospheric intrinsic variability in the long simulation. We also find that model climate statistics of the simulation ensemble are sensitive to the choice of compiler optimizations. While some answer-changing optimization choices do not effect the climate state in terms of mean, variability and extremes, aggressive optimizations can result in significantly different climate states.
Journal of Advances in Modeling Earth Systems | 2018
Matthew R. Norman; R. D. Nair
Modern computer architectures reward added computation if it reduces algorithmic dependence, reduces data movement, increases accuracy/robustness, and improves memory accesses. The driving motive for this study is to develop a numerical algorithm that respects these constraints while improving accuracy and robustness. This study introduces the ADER-DT (Arbitrary DERivatives in time and space-differential transform) time discretization to positive-definite, weighted essentially nonoscillatory (WENO)-limited, finite volume transport on the cubed sphere in lieu of semidiscrete integrators. The cost of the ADER-DT algorithm is significantly improved from previous implementations without affecting accuracy. A new function-based WENO implementation is also detailed for use with the ADER-DT time discretization. While ADER-DT costs about 1.5 times more than a fourth-order, five-stage strong stability preserving Runge-Kutta (SSPRK4) method, it is far more computationally dense (which is advantageous on accelerators such as graphics processing units), and it has a larger effective maximum stable time step. ADER-DT errors converge more quickly with grid refinement than SSPRK4, giving 6.5 times less error in the L∞ norm than SSPRK4 at the highest refinement level for smooth data. For nonsmooth data, ADER-DT resolves C0 discontinuities more sharply. For a complex flow field, ADER exhibits less phase error than SSPRK4. Improving both accuracy and robustness as well as better respecting modern computational efficiency requirements, we believe the method presented herein is competitive for efficiently transporting tracers over the sphere for applications targeting modern computing architectures. Plain Language Summary This study introduces and analyzes a numerical technique for transporting quantities on the sphere using wind trajectories that explicitly takes into account the needs of modern massively parallel computer architectures while simultaneously improving accuracy and robustness. On modern computers, moving data around is extremely expensive, while performing computations on data is extremely cheap. Therefore, algorithms that perform as many useful computations on a given chunk of data before moving it will be rewarded with better efficiency. While doing this, the algorithm must continue to respect traditional accuracy constraints as well such as conserving mass, giving useful solutions without noise contamination, and resolving features as well as possible. The method presented here performs nearly all of the computations on a small chunk of data without having to move the data intermittently. Therefore, it will perform very well on accelerator devices such as graphics processing units compared to existing methods, which interrupt the computation with many intermittent spurts of data movement. Test cases show that for smooth and nonsmooth data as well as simple and complex flow fields, the proposed algorithm not only fits into modern computing constraints but also improves accuracy compared to current methods.
Journal of Advances in Modeling Earth Systems | 2017
Joseph H. Kennedy; Andrew R. Bennett; Katherine J. Evans; Stephen Price; Matthew J. Hoffman; William H. Lipscomb; Jeremy G. Fyke; Lauren Vargo; Adrianna Boghozian; Matthew R. Norman; Patrick H. Worley
To address the pressing need to better understand the behavior and complex interaction of ice sheets within the global Earth system, significant development of continental-scale, dynamical ice sheet models is underway. Concurrent to the development of the Community Ice Sheet Model (CISM), the corresponding verification and validation (V&V) process is being coordinated through a new, robust, Python-based extensible software package, the Land Ice Verification and Validation toolkit (LIVVkit). Incorporated into the typical ice sheet model development cycle, it provides robust and automated numerical verification, software verification, performance validation, and physical validation analyses on a variety of platforms, from personal laptops to the largest supercomputers. LIVVkit operates on sets of regression test and reference data sets, and provides comparisons for a suite of community prioritized tests, including configuration and parameter variations, bit-for-bit evaluation, and plots of model variables to indicate where differences occur. LIVVkit also provides an easily extensible framework to incorporate and analyze results of new intercomparison projects, new observation data, and new computing platforms. LIVVkit is designed for quick adaptation to additional ice sheet models via abstraction of model specific code, functions, and configurations into an ice sheet model description bundle outside the main LIVVkit structure. Ultimately, through shareable and accessible analysis output, LIVVkit is intended to help developers build confidence in their models and enhance the credibility of ice sheet models overall.
International Journal of High Performance Computing Applications | 2017
Katherine J. Evans; Richard K Archibald; David J. Gardner; Matthew R. Norman; Mark A. Taylor; Carol S. Woodward; Patrick H. Worley
Explicit Runge–Kutta methods and implicit multistep methods utilizing a Newton–Krylov nonlinear solver are evaluated for a range of configurations of the shallow-water dynamical core of the spectral element community atmosphere model to evaluate their computational performance. These configurations are designed to explore the attributes of each method under different but relevant model usage scenarios including varied spectral order within an element, static regional refinement, and scaling to the largest problem sizes. This analysis is performed within the shallow-water dynamical core option of a full climate model code base to enable a wealth of simulations for study, with the aim of informing solver development within the more complete hydrostatic dynamical core used for climate research. The limitations and benefits to using explicit versus implicit methods, with different parameters and settings, are discussed in light of the trade-offs with Message Passing Interface (MPI) communication and memory and their inherent efficiency bottlenecks. Given the performance behavior across the configurations analyzed here, the recommendation for future work using the implicit solvers is conditional based on scale separation and the stiffness of the problem. For the regionally refined configurations, the implicit method has about the same efficiency as the explicit method, without considering efficiency gains from a preconditioner. The potential for improvement using a preconditioner is greatest for higher spectral order configurations, where more work is shifted to the linear solver. Initial simulations with OpenACC directives to utilize a Graphics Processing Unit (GPU) when performing function evaluations show improvements locally, and that overall gains are possible with adjustments to data exchanges.