Is this you? Create Your Porfile

Jonathan Carter

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jonathan Carter is active.

Explore More

Publication

Featured researches published by Jonathan Carter.

ieee international conference on high performance computing data and analytics | 2008

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Kaushik Datta; Mark Murphy; Vasily Volkov; Samuel Williams; Jonathan Carter; Leonid Oliker; David A. Patterson; John Shalf; Katherine A. Yelick

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.

international parallel and distributed processing symposium | 2008

Lattice Boltzmann simulation optimization on leading multicore platforms

Samuel Williams; Jonathan Carter; Leonid Oliker; John Shalf; Katherine A. Yelick

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara!, STI Cell, as well as the single core Intel Itanium.2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto- tuned LBMHD application achieves up to a 14times improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.

conference on high performance computing (supercomputing) | 2004

Scientific Computations on Modern Parallel Vector Systems

Leonid Oliker; Andrew Canning; Jonathan Carter; John Shalf; Stephane Ethier

Computational scientists have seen a frustrating trend of stagnating application performance despite dramatic increases in the claimed peak capability of high performance computing systems. This trend has been widely attributed to the use of superscalar-based commodity components who’s architectural designs offer a balance between memory performance, network capability, and execution rate that is poorly matched to the requirements of large-scale numerical computations. Recently, two innovative parallel-vector architectures have become operational: the Japanese Earth Simulator (ES) and the Cray X1. In order to quantify what these modern vector capabilities entail for the scientists that rely on modeling and simulation, it is critical to evaluate this architectural paradigm in the context of demanding computational algorithms. Our evaluation study examines four diverse scientific applications with the potential to run at ultrascale, from the areas of plasma physics, material science, astrophysics, and magnetic fusion. We compare performance between the vector-based ES and X1, with leading superscalar-based platforms: the IBM Power3/4 and the SGI Altix. Our research team was the first international group to conduct a performance evaluation study at the Earth Simulator Center; remote ES access in not available. Results demonstrate that the vector systems achieve excellent performance on our application suite - the highest of any architecture tested to date. However, vectorization of a particle-in-cell code highlights the potential difficulty of expressing irregularly structured algorithms as data-parallel programs.

conference on high performance computing (supercomputing) | 2005

Leading Computational Methods on Scalar and Vector HEC Platforms

Leonid Oliker; Jonathan Carter; Michael F. Wehner; Andrew Canning; Stephane Ethier; Arthur A. Mirin; David Parks; Patrick H. Worley; Shigemune Kitawaki; Yoshinori Tsuda

The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end computing (HEC) platforms, primarily because of their generality, scalability, and cost effectiveness. However, the growing gap between sustained and peak performance for full-scale scientific applications on conventional supercomputers has become a major concern in high performance computing, requiring significantly larger systems and application scalability than implied by peak performance in order to achieve desired performance. The latest generation of custom-built parallel vector systems have the potential to address this issue for numerical algorithms with sufficient regularity in their computational structure. In this work we explore applications drawn from four areas: atmospheric modeling (CAM), magnetic fusion (GTC), plasma physics (LBMHD3D), and material science (PARATEC). We compare performance of the vector-based Cray X1, Earth Simulator, and newly-released NEC SX-8 and Cray X1E, with performance of three leading commodity-based superscalar platforms utilizing the IBM Power3, Intel Itanium2, and AMD Opteron processors. Our work makes several significant contributions: the first reported vector performance results for CAM simulations utilizing a finite-volume dynamical core on a high-resolution atmospheric grid; a new data-decomposition scheme for GTC that (for the first time) enables a breakthrough of the Teraflop barrier; the introduction of a new three-dimensional Lattice Boltzmann magneto-hydrodynamic implementation used to study the onset evolution of plasma turbulence that achieves over 26Tflop/s on 4800 ES promodity-based superscalar platforms utilizing the IBM Power3, Intel Itanium2, and AMD Opteron processors, with modern parallel vector systems: the Cray X1, Earth Simulator (ES), and the NEC SX-8. Additionally, we examine performance of CAM on the recently-released Cray X1E. Our research team was the first international group to conduct a performance evaluation study at the Earth Simulator Center; remote ES access is not available. Our work builds on our previous efforts [16, 17] and makes several significant contributions: the first reported vector performance results for CAM simulations utilizing a finite-volume dynamical core on a high-resolution atmospheric grid; a new datadecomposition scheme for GTC that (for the first time) enables a breakthrough of the Teraflop barrier; the introduction of a new three-dimensional Lattice Boltzmann magneto-hydrodynamic implementation used to study the onset evolution of plasma turbulence that achieves over 26Tflop/s on 4800 ES processors; and the largest PARATEC cell size atomistic simulation to date. Overall, results show that the vector architectures attain unprecedented aggregate performance across our application suite, demonstrating the tremendous potential of modern parallel vector systems.

international conference on parallel processing | 2005

Integrated performance monitoring of a cosmology application on leading HEC platforms

J. Borrill; Jonathan Carter; Leonid Oliker; David Skinner; Rupak Biswas

The cosmic microwave background (CMB) is an exquisitely sensitive probe of the fundamental parameters of cosmology. Extracting this information is computationally intensive, requiring massively parallel computing and sophisticated numerical algorithms. In this work we present MADbench, a lightweight version of the MADCAP CMB power spectrum estimation code that retains the operational complexity and integrated system requirements. In addition, to quantify communication behavior across a variety of architectural platforms, we introduce the integrated performance monitoring (IPM) package: a portable, lightweight, and scalable tool for effectively extracting MPI message-passing overheads. A performance characterization study is conducted on some of the worlds most powerful supercomputers, including the superscalar Seaborg (IBM Power3+) and CC-NUMA Columbia (SGIAltix), as well as the vector-based Earth Simulator (NEC SX-6 enhanced) and Phoenix (Cray XI) systems. In-depth analysis shows that in order to bridge the gap between theoretical and sustained system performance, it is critical to gain a clear understanding of how the distinct parts of large-scale parallel applications interact with the individual subcomponents of HEC platforms.

conference on high performance computing (supercomputing) | 2003

Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

Leonid Oliker; Andrew Canning; Jonathan Carter; John Shalf; David Skinner; Ethier Ethier; Rupak Biswas; Jahed Djomehri; Rob F. Van der Wijngaart

The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in comput-ing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly outperforms the cache-based architectures. However, certain applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX-6 effectively.

Journal of Parallel and Distributed Computing | 2009

Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Samuel Williams; Jonathan Carter; Leonid Oliker; John Shalf; Katherine A. Yelick

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the high-performance computing (HPC) literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well as a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15 times improvement compared with the original code at a given concurrency. Additionally, we present a detailed analysis of each optimization, which reveals surprising hardware bottlenecks and software challenges for future multicore systems and applications.

international parallel and distributed processing symposium | 2007

Scientific Application Performance on Candidate PetaScale Platforms

Leonid Oliker; Andrew Canning; Jonathan Carter; C. lancu; Michael J. Lijewski; Shoaib Kamil; John Shalf; Hongzhang Shan; Erich Strohmaier; Stephane Ethier; Tom Goodale

After a decade where HEC (high-end computing) capability was dominated by the rapid pace of improvements to CPU clock frequency, the performance of next-generation supercomputers is increasingly differentiated by varying interconnect designs and levels of integration. Understanding the tradeoffs of these system designs, in the context of high-end numerical simulations, is a key step towards making effective petascale computing a reality. This work represents one of the most comprehensive performance evaluation studies to date on modern NEC systems, including the IBM Power5, AMD Opteron, IBM BG/L, and Cray X1E. A novel aspect of our study is the emphasis on full applications, with real input data at the scale desired by computational scientists in their unique domain. We examine six candidate ultra-scale applications, representing a broad range of algorithms and computational structures. Our work includes the highest concurrency experiments to date on five of our six applications, including 32K processor scalability for two of our codes and describe several successful optimizations strategies on BG/L, as well as improved X1E vectorization. Overall results indicate that our evaluated codes have the potential to effectively utilize petascale resources; however, several applications would require reengineering to incorporate the additional levels of parallelism necessary to achieve the vast concurrency of upcoming ultra-scale systems.

ieee international conference on high performance computing data and analytics | 2011

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Samuel Williams; Leonid Oliker; Jonathan Carter; John Shalf

We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4x using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems.

ieee international conference on high performance computing data and analytics | 2004

A performance evaluation of the cray x1 for scientific applications

Leonid Oliker; Rupak Biswas; J. Borrill; Andrew Canning; Jonathan Carter; M. Jahed Djomehri; Hongzhang Shan; David Skinner

The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end capability and capacity computers primarily because of their generality, scalability, and cost effectiveness. However, the recent development of massively parallel vector systems is having a significant effect on the supercomputing landscape. In this paper, we compare the performance of the recently-released Cray X1 vector system with that of the cacheless NEC SX-6 vector machine, and the superscalar cache-based IBM Power3 and Power4 architectures for scientific applications. Overall results demonstrate that the X1 is quite promising, but performance improvements are expected as the hardware, systems software, and numerical libraries mature. Code reengineering to effectively utilize the complex architecture may also lead to significant efficiency enhancements.

Explore More