Stephen M. Guzik
Colorado State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stephen M. Guzik.
52nd Aerospace Sciences Meeting | 2014
Xinfeng Gao; Stephen M. Guzik; Phillip Colella
This study focuses on a fourth-order boundary treatment for nite-volume schemes to solve the compressible Navier-Stokes equations on a Cartesian grid. A fourth-order nite-volume stencil is derived for the viscous stress tensor operator and the modi ed fourth-order stencil near the physical boundary is developed. Fourier error analysis and stability analysis are performed for the fourth-order elliptic operator. For time integration, we use the fourth-order Runge-Kutta method. The fourth-order scheme was applied to the transient Couette ow and the solution accuracy was veri ed.
ieee international conference on high performance computing data and analytics | 2014
Catherine Olschanowsky; Michelle Mills Strout; Stephen M. Guzik; John Loffeld; J. Hittinger
Structured-grid PDE solver frameworks parallelize over boxes, which are rectangular domains of cells or faces in a structured grid. In the Chombo framework, the box sizes are typically 163 or 323, but larger box sizes such as 1283 would result in less surface area and therefore less storage, copying, and/or ghost cells communication overhead. Unfortunately, current on node parallelization schemes perform poorly for these larger box sizes. In this paper, we investigate 30 different inter-loop optimization strategies and demonstrate the parallel scaling advantages of some of these variants on NUMA multicore nodes. Shifted, fused, and communication-avoiding variants for 1283 boxes result in close to ideal parallel scaling and come close to matching the performance of 163 boxes on three different multicore systems for a benchmark that is a proxy for program idioms found in Computational Fluid Dynamic (CFD) codes.
Journal of Computational Physics | 2014
Stephen M. Guzik; Todd H. Weisgraber; Phillip Colella; Berni J. Alder
A lattice-Boltzmann model to solve the equivalent of the Navier-Stokes equations on adaptively refined grids is presented. A method for transferring information across interfaces between different grid resolutions was developed following established techniques for finite-volume representations. This new approach relies on a space-time interpolation and solving constrained least-squares problems to ensure conservation. The effectiveness of this method at maintaining the second order accuracy of lattice-Boltzmann is demonstrated through a series of benchmark simulations and detailed mesh refinement studies. These results exhibit smaller solution errors and improved convergence when compared with similar approaches relying only on spatial interpolation. Examples highlighting the mesh adaptivity of this method are also provided.
53rd AIAA Aerospace Sciences Meeting | 2015
Xinfeng Gao; Stephen M. Guzik
This study focuses on a fourth-order nite-volume scheme to solve the compressible Navier-Stokes equations on a Cartesian grid. Two approaches for handling boundary conditions are presented, one approach employs one-sided-di erence methods and the other approach almost completely supports using centered-di erence methods everywhere. Both approaches are genuinely fourth-order in space and time for the treatment of viscous terms. For time integration, a fourth-order Runge-Kutta method is used. The fourth-order scheme was applied to an unsteady boundary layer at plate ow and the solution accuracy was qualitatively veri ed.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013
Christopher D. Krieger; Michelle Mills Strout; Catherine Olschanowsky; Andrew Stone; Stephen M. Guzik; Xinfeng Gao; Carlo Bertolli; Paul H. J. Kelly; Gihan R. Mudalige; Brian Van Straalen; Samuel Williams
There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding inter-procedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or run-time system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, OP2, used in full applications with unstructured grids, and a domain-specific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores.
Computers & Mathematics With Applications | 2016
Stephen M. Guzik; Xinfeng Gao; Catherine Olschanowsky
Abstract This work focuses on the development of a high-performance fourth-order finite-volume method to solve the nonlinear partial differential equations governing the compressible Navier–Stokes equations on a Cartesian grid with adaptive mesh refinement. The novelty of the present study is to introduce the loop chaining concept to this complex fourth-order fluid dynamics algorithm for significant improvement in code performance on parallel machines. Specific operations involved in the algorithm include the finite-volume formulation of fourth-order spatial discretization stencils and optimal inter-loop parallelization strategies. Numerical fluxes of the Navier–Stokes equations comprise the hyperbolic (inviscid) and elliptic (viscous) components. The hyperbolic flux is evaluated using high-resolution Godunov’s method and the elliptic flux is based on fourth-order centered-difference methods everywhere in the computational domain. The use of centered-difference methods everywhere supports the idea of fusing modular codes to achieve high efficiency on modern computers. Temporal discretization is performed using the standard fourth-order Runge–Kutta method. The fourth-order accuracy of solution in space and time is verified with a transient Couette flow problem. The algorithm is applied to solve the Sod’s shock tube and the transient flat-plate boundary layer flow. The numerical predictions are validated by comparing to the analytical solutions. The performance of the baseline code is compared to that of the fused scheme which fuses modular codes via loop chaining concept and a significant improvement in execution time is observed.
50th AIAA Aerospace Sciences Meeting including the New Horizons Forum and Aerospace Exposition | 2012
Stephen M. Guzik; Peter McCorquodale; Phillip Colella
A fourth-order accurate finite-volume method is presented for solving time-dependent hyperbolic systems of conservation laws on mapped grids that are adaptively refined in space and time. Novel considerations for formulating the semi-discrete system of equations in computational space combined with detailed mechanisms for accommodating the adapting grids ensure that conservation is maintained and that the divergence of a constant vector field is always zero (freestream-preservation property). Advancement in time is achieved with a fourth-order Runge-Kutta method.
international workshop on openmp | 2015
Pei-Hung Lin; Chunhua Liao; Daniel J. Quinlan; Stephen M. Guzik
The Department of Energy has a wide range of large-scale, parallel scientific applications running on cutting-edge high-performance computing systems to support its mission and tackle critical science challenges. A recent trend in these high-performance computing systems is to add commodity accelerators, such as Nvidia GPUs and Intel Xeon Phi coprocessors, into computer nodes so we can achieve increased performance without exceeding the limited power budget. However, it is well-known in the high-performance computing community that porting existing applications to accelerators is a difficult task given the numerous set of unique hardware features and the general complexity of software. In this paper, we share our experiences of using the OpenMP Accelerator Model to port two stencil applications to exploit Nvidia GPUs. Introduced as part of the OpenMP 4.0 specification, the OpenMP accelerator model provides a set of directives for users to specify semantics related to accelerators so that compilers and runtime systems can automatically handle repetitive and error-prone accelerator programming tasks, including code transformations, work scheduling, data management, reduction, and so on. Using a prototype compiler implementation based on the ROSE source-to-source compiler framework, we report the problems we encountered during the porting process, our solutions, and the obtained performance. Productivity is also evaluated. Our experience shows that the existing OpenMP Accelerator Model can effectively help programmers leverage accelerators. However, complex data types and non-canonical control structures can pose challenges for programmers to productively apply accelerator directives.
49th AIAA/ASME/SAE/ASEE Joint Propulsion Conference | 2013
Nathan Spotts; Stephen M. Guzik; Xinfeng Gao
Work presented at the 1 Propulsion Aerodynamics Workshop on the CFD study of compressible ow through convergent-conical nozzles is summarized. This work focused on assessing the accuracy of the CFD study in obtaining nozzle performance and ow structure, including nozzle thrust and discharge coe cients and the shock structure. The CFD studies were performed using Metacomp CFD++ software and compared with the available experimental data during the workshop. We con rmed that the discharge coe cient increases as the nozzle angle decreases and the choked nozzle pressure ratio is lower for a smaller nozzle angle. The discharge coe cient increases with increasing pressure ratio until the choked condition is reached. The thrust coe cient increases as the nozzle angle increases, and for a given nozzle angle, the thrust coe cient decreases as nozzle pressure ratio increases. Results and assessments are presented in this paper and at the 49 AIAA Joint Propulsion Conference.
Proceedings of the Third International Workshop on Accelerator Programming Using Directives | 2016
Ian J. Bertolacci; Michelle Mills Strout; Stephen M. Guzik; Jordan Riley; Catherine Olschanowsky
Exposing opportunities for parallelization while explicitly managing data locality is the primary challenge to porting and optimizing existing computational science simulation codes to improve performance and accuracy. OpenMP provides many mechanisms for expressing parallelism, but it primarily remains the programmers responsibility to group computations to improve data locality. The loopchain abstraction, where data access patterns are included with the specification of parallel loops, provides compilers with sufficient information to automate the parallelism versus data locality tradeoff. In this paper, we present a loop chain pragma and an extension to the omp for to enable the specification of loop chains and high-level specifications of schedules on loop chains. We show example usage of the extensions, describe their implementation, and show preliminary performance results for some simple examples.