Is this you? Create Your Porfile

Brian Van Straalen

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Brian Van Straalen is active.

Explore More

Publication

Featured researches published by Brian Van Straalen.

ieee international conference on high performance computing data and analytics | 2012

Optimization of geometric multigrid for emerging multi- and manycore processors

Samuel Williams; Dhiraj D. Kalamkar; Amik Singh; Anand M. Deshpande; Brian Van Straalen; Mikhail Smelyanskiy; Ann S. Almgren; Pradeep Dubey; John Shalf; Leonid Oliker

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel® Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi coprocessor (Knights Corner). Our work examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.

Journal of Physics: Conference Series | 2007

Performance and scaling of locally-structured grid methods for partial differential equations

Phillip Colella; John B. Bell; Noel Keen; Terry J. Ligocki; Michael J. Lijewski; Brian Van Straalen

In this paper, we discuss some of the issues in obtaining high performance for block-structured adaptive mesh refinement software for partial differential equations. We show examples in which AMR scales to thousands of processors. We also discuss a number of metrics for performance and scalability that can provide a basis for understanding the advantages and disadvantages of this approach.

ieee international conference on high performance computing data and analytics | 2014

Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis

Yu Jung Lo; Samuel Williams; Brian Van Straalen; Terry J. Ligocki; Matthew J. Cordery; Nicholas J. Wright; Mary W. Hall; Leonid Oliker

We present preliminary results of the Roofline Toolkit for multicore, manycore, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measure sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.

Archive | 2014

HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems

Mark Adams; Jed Brown; John Shalf; Brian Van Straalen; Erich Strohmaier; Samuel Williams

This document provides an overview of the benchmark ? HPGMG ? for ranking large scale general purpose computers for use on the Top500 list [8]. We provide a rationale for the need for a replacement for the current metric HPL, some background of the Top500 list and the challenges of developing such a metric; we discuss our design philosophy and methodology, and an overview of the specification of the benchmark. The primary documentation with maintained details on the specification can be found at hpgmg.org and the Wiki and benchmark code itself can be found in the repository https://bitbucket.org/hpgmg/hpgmg.

Journal of Parallel and Distributed Computing | 2014

A survey of high level frameworks in block-structured adaptive mesh refinement packages

Anshu Dubey; Ann S. Almgren; John B. Bell; Martin Berzins; Steven R. Brandt; Greg L. Bryan; Phillip Colella; Daniel T. Graves; Michael J. Lijewski; Frank Löffler; Brian W. O'Shea; Brian Van Straalen; Klaus Weide

Over the last decade block-structured adaptive mesh refinement (SAMR) has found increasing use in large, publicly available codes and frameworks. SAMR frameworks have evolved along different paths. Some have stayed focused on specific domain areas, others have pursued a more general functionality, providing the building blocks for a larger variety of applications. In this survey paper we examine a representative set of SAMR packages and SAMR-based codes that have been in existence for half a decade or more, have a reasonably sized and active user base outside of their home institutions, and are publicly available. The set consists of a mix of SAMR packages and application codes that cover a broad range of scientific domains. We look at their high-level frameworks, their design trade-offs and their approach to dealing with the advent of radical changes in hardware architecture. The codes included in this survey are BoxLib, Cactus, Chombo, Enzo, FLASH, and Uintah. A survey of mature openly available state-of-the-art structured AMR libraries and codes.Discussion of their frameworks, challenges and design trade-offs.Directions being pursued by the codes to prepare for the future many-core and heterogeneous platforms.

international parallel and distributed processing symposium | 2009

Scalability challenges for massively parallel AMR applications

Brian Van Straalen; John Shalf; Terry J. Ligocki; Noel Keen; Woo-Sun Yang

PDE solvers using Adaptive Mesh Refinement on block structured grids are some of the most challenging applications to adapt to massively parallel computing environments. We describe optimizations to the Chombo AMR framework that enable it to scale efficiently to thousands of processors on the Cray XT4. The optimization process also uncovered OS-related performance variations that were not explained by conventional OS interference benchmarks. Ultimately the variability was traced back to complex interactions between the application, system software, and the memory hierarchy. Once identified, software modifications to control the variability improved performance by 20% and decreased the variation in computation time across processors by a factor of 3. These newly identified sources of variation will impact many applications and suggest new benchmarks for OS-services be developed.

international parallel and distributed processing symposium | 2015

Compiler-Directed Transformation for Higher-Order Stencils

Protonu Basu; Mary W. Hall; Samuel Williams; Brian Van Straalen; Leonid Oliker; Phillip Colella

As the cost of data movement increasingly dominates performance, developers of finite-volume and finite-difference solutions for partial differential equations (PDEs) are exploring novel higher-order stencils that increase numerical accuracy and computational intensity. This paper describes a new compiler reordering transformation applied to stencil operators that performs partial sums in buffers, and reuses the partial sums in computing multiple results. This optimization has multiple effect son improving stencil performance that are particularly important to higher-order stencils: exploits data reuse, reduces floating-point operations, and exposes efficient SIMD parallelism to backend compilers. We study the benefit of this optimization in the context of Geometric Multigrid (GMG), a widely used method to solve PDEs, using four different Jacobi smoothers built from 7-, 13-, 27- and 125-point stencils. We quantify performance, speedup, and numerical accuracy, and use the Roofline model to qualify our results. Ultimately, we obtain over 4× speedup on the smoothers themselves and up to a 3× speedup on the multigrid solver. Finally, we demonstrate that high-order multigrid solvers have the potential of reducing total data movement and energy by several orders of magnitude.

ieee international conference on high performance computing, data, and analytics | 2013

Compiler generation and autotuning of communication-avoiding operators for geometric multigrid

Protonu Basu; Anand Venkat; Mary W. Hall; Samuel Williams; Brian Van Straalen; Leonid Oliker

This paper describes a compiler approach to introducing communication-avoiding optimizations in geometric multigrid (GMG), one of the most popular methods for solving partial differential equations. Communication-avoiding optimizations reduce vertical communication through the memory hierarchy and horizontal communication across processes or threads, usually at the expense of introducing redundant computation. We focus on applying these optimizations to the smooth operator, which successively reduces the error and accounts for the largest fraction of the GMG execution time. Our compiler technology applies both novel and known transformations to derive an implementation comparable to manually-tuned code. To make the approach portable, an underlying autotuning system explores the tradeoff between reduced communication and increased computation, as well as tradeoffs in threading schemes, to automatically identify the best implementation for a particular architecture and at each computation phase. Results show that we are able to quadruple the performance of the smooth operation on the finest grids while attaining performance within 94% of manually-tuned code. Overall we improve the overall multigrid solve time by 2.5× without sacrificing programer productivity.

international parallel and distributed processing symposium | 2014

s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid

Samuel Williams; Michael J. Lijewski; Ann S. Almgren; Brian Van Straalen; Erin Carson; Nicholas Knight; James Demmel

Geometric multigrid solvers within adaptive mesh refinement (AMR) applications often reach a point where further coarsening of the grid becomes impractical as individual sub domain sizes approach unity. At this point the most common solution is to use a bottom solver, such as BiCGStab, to reduce the residual by a fixed factor at the coarsest level. Each iteration of BiCGStab requires multiple global reductions (MPI collectives). As the number of BiCGStab iterations required for convergence grows with problem size, and the time for each collective operation increases with machine scale, bottom solves in large-scale applications can constitute a significant fraction of the overall multigrid solve time. In this paper, we implement, evaluate, and optimize a communication-avoiding s-step formulation of BiCGStab (CABiCGStab for short) as a high-performance, distributed-memory bottom solver for geometric multigrid solvers. This is the first time s-step Krylov subspace methods have been leveraged to improve multigrid bottom solver performance. We use a synthetic benchmark for detailed analysis and integrate the best implementation into BoxLib in order to evaluate the benefit of a s-step Krylov subspace method on the multigrid solves found in the applications LMC and Nyx on up to 32,768 cores on the Cray XE6 at NERSC. Overall, we see bottom solver improvements of up to 4.2x on synthetic problems and up to 2.7x in real applications. This results in as much as a 1.5x improvement in solver performance in real applications.

Journal of Physics: Conference Series | 2008

Performance of embedded boundary methods for CFD with complex geometry

David Trebotich; Brian Van Straalen; Dan Graves; P. Colella

In this paper, we discuss some of the issues in obtaining high performance for block-structured adaptive mesh refinement software for partial differential equations in complex geometry using embedded boundary/volume-of-fluid methods. We present the design of an adaptive embedded boundary multigrid algorithm for elliptic problems. We show examples in which this new elliptic solver scales to 1000 processors. We also apply this technology to more complex mathematical and physical algorithms for incompressible fluid dynamics and demonstrate similar scaling.

Explore More