Is this you? Create Your Porfile

Erich Strohmaier

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Erich Strohmaier is active.

Explore More

Publication

Featured researches published by Erich Strohmaier.

international parallel and distributed processing symposium | 2008

Power efficiency in high performance computing

Shoaib Kamil; John Shalf; Erich Strohmaier

After 15 years of exponential improvement in microprocessor clock rates, the physical principles allowing for Dennard scaling, which enabled performance improvements without a commensurate increase in power consumption, have all but ended. Until now, most HPC systems have not focused on power efficiency. However, as the cost of power reaches parity with capital costs, it is increasingly important to compare systems with metrics based on the sustained performance per watt. Therefore we need to establish practical methods to measure power consumption of such systems in- situ in order to support such metrics. Our study provides power measurements for various computational loads on the largest scale HPC systems ever involved in such an assessment. This study demonstrates clearly that, contrary to conventional wisdom, the power consumed while running the high performance Linpack (HPL) benchmark is very close to the power consumed by any subset of a typical compute-intensive scientific workload. Therefore, HPL, which in most cases cannot serve as a suitable workload for performance measurements, can be used for the purposes of power measurement. Furthermore, we show through measurements on a large scale system that the power consumed by smaller subsets of the system can be projected straightforwardly and accurately to estimate the power consumption of the full system. This allows a less invasive approach for determining the power consumption of large-scale systems.

conference on high performance computing (supercomputing) | 2005

Quantifying Locality In The Memory Access Patterns of HPC Applications

Jonathan Weinberg; Michael O. McCracken; Erich Strohmaier; Allan Snavely

Several benchmarks for measuring the memory performance of HPC systems along dimensions of spatial and temporal memory locality have recently been proposed. However, little is understood about the relationships of these benchmarks to real applications and to each other. We propose a methodology for producing architecture-neutral characterizations of the spatial and temporal locality exhibited by the memory access patterns of applications. We demonstrate that the results track intuitive notions of locality on several synthetic and application benchmarks. We employ the methodology to analyze the memory performance components of the HPC Challenge Benchmarks, the Apex-MAP benchmark, and their relationships to each other and other benchmarks and applications. We show that this analysis can be used to both increase understanding of the benchmarks and enhance their usefulness by mapping them, along with applications, to a 2-D space along axes of spatial and temporal locality.

Computing in Science and Engineering | 2005

High-performance computing: clusters, constellations, MPPs, and future directions

Jack J. Dongarra; Thomas L. Sterling; Horst D. Simon; Erich Strohmaier

High Performance Computing Clusters, Constellations, MPPs, and Future Directions Jack Dongarra University of Tennessee Thomas Sterling California Institute of Technology Horst Simon Erich Strohmaier Lawrence Berkeley National Laboratory June 10, 2003 Abstract Last year’s paper by Bell and Gray [1] examined past trends in high performance computing and asserted likely future directions based on market forces. While many of the insights drawn from this perspective have merit and suggest elements governing likely future directions for HPC, there are a number of points put forth that we feel require further discussion and, in certain cases, suggest alternative, more likely views. One area of concern relates to the nature and use of key terms to describe and distinguish among classes of high end computing systems, in particular the authors’ use of “cluster” to relate to essentially all parallel computers derived through the integration of replicated components. The taxonomy implicit in their previous paper, while arguable and supported by some elements of our community, fails to provide the essential semantic discrimination critical to the effectiveness of descriptive terms as tools in managing the conceptual space of consideration. In this paper, we present a perspective that retains the descriptive richness while providing a unifying framework. A second area of discourse that calls for additional commentary is the likely future path of system evolution that will lead to effective and affordable Petaflops-scale computing including the future role of computer centers as facilities for supporting high performance computing environments. This paper addresses the key issues

modeling, analysis, and simulation on computer and telecommunication systems | 2004

Architecture independent performance characterization and benchmarking for scientific applications

Erich Strohmaier; Hongzhang Shan

A simple, tunable, synthetic benchmark with a performance directly related to applications would be of great benefit to the scientific computing community. We present a novel approach to developing such a benchmark. The initial focus of this project is on the data access performance of scientific applications. First, a hardware independent characterization of code performance in terms of address streams is developed. The parameters chosen to characterize a single address stream are related to regularity, size, and spatial and temporal locality. These parameters are then used to implement a synthetic benchmark program that mimics the performance of a corresponding code. To test the validity of our approach we performed experiments using five test kernels on six different platforms. The performance of most of our test kernels can be approximated by a single synthetic address stream. However, in some cases, overlapping two address streams is necessary to achieve a good approximation.

parallel computing | 1999

The marketplace of high-performance computing

Erich Strohmaier; Jack J. Dongarra; Hans Werner Meuer; Horst D. Simon

In this paper we analyze the major trends and changes in the High-Performance Computing (HPC) market place since the beginning of the journal ‘Parallel Computing’. The initial success of vector computers in the 1970s was driven by raw performance. The introduction of this type of computer systems started the area of ‘Supercomputing’. In the 1980s the availability of standard development environments and of application software packages became more important. Next to performance these factors determined the success of MP vector systems, especially at industrial customers. MPPs became successful in the early 1990s due to their better price/performance ratios, which was made possible by the attack of the ‘killer-micros’. In the lower and medium market segments the MPPs were replaced by microprocessor based symmetrical multiprocessor (SMP) systems in the middle of the 1990s. There success formed the basis for the use of new cluster concepts for very high-end systems. In the last few years only the companies which have entered the emerging markets for massive parallel database servers and financial applications attract enough business volume to be able to support the hardware development for the numerical high-end computing market as well. Success in the traditional floating point intensive engineering applications seems to be no longer suAcient for survival in the market. ” 1999 Elsevier Science B.V. All rights reserved.

Archive | 2014

HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems

Mark Adams; Jed Brown; John Shalf; Brian Van Straalen; Erich Strohmaier; Samuel Williams

This document provides an overview of the benchmark ? HPGMG ? for ranking large scale general purpose computers for use on the Top500 list [8]. We provide a rationale for the need for a replacement for the current metric HPL, some background of the Top500 list and the challenges of developing such a metric; we discuss our design philosophy and methodology, and an overview of the specification of the benchmark. The primary documentation with maintained details on the specification can be found at hpgmg.org and the Wiki and benchmark code itself can be found in the repository https://bitbucket.org/hpgmg/hpgmg.

ieee international conference on high performance computing data and analytics | 2009

Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Kamesh Madduri; Samuel Williams; Stephane Ethier; Leonid Oliker; John Shalf; Erich Strohmaier; Katherine Yelicky

We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.

international conference on performance engineering | 2014

A power-measurement methodology for large-scale, high-performance computing

Thomas R. W. Scogland; Craig P. Steffen; Torsten Wilde; Florent Parent; Susan Coghlan; Natalie J. Bates; Wu-chun Feng; Erich Strohmaier

Improvement in the energy efficiency of supercomputers can be accelerated by improving the quality and comparability of efficiency measurements. The ability to generate accurate measurements at extreme scale are just now emerging. The realization of system-level measurement capabilities can be accelerated with a commonly adopted and high quality measurement methodology for use while running a workload, typically a benchmark. This paper describes a methodology that has been developed collaboratively through the Energy Efficient HPC Working Group to support architectural analysis and comparative measurements for rankings, such as the Top500 and Green500. To support measurements with varying amounts of effort and equipment required we present three distinct levels of measurement, which provide increasing levels of accuracy. Level 1 is similar to the Green500 run rules today, a single average power measurement extrapolated from a subset of a machine. Level 2 is more comprehensive, but still widely achievable. Level 3 is the most rigorous of the three methodologies but is only possible at a few sites. However, the Level 3 methodology generates a high quality result that exposes details that the other methodologies may miss. In addition, we present case studies from the Leibniz Supercomputing Centre (LRZ), Argonne National Laboratory (ANL) and Calcul Québec Université Laval that explore the benefits and difficulties of gathering high quality, system-level measurements on large-scale machines.

conference on high performance computing (supercomputing) | 2005

Apex-Map: A Global Data Access Benchmark to Analyze HPC Systems and Parallel Programming Paradigms

Erich Strohmaier; Hongzhang Shan

The memory wall and global data movement have become the dominant performance bottleneck for many scientific applications. New characterizations of data access streams and related benchmarks to measure their performances are therefore needed to compare HPC systems, software, and programming paradigms effectively. In this paper, we introduce a novel global data access benchmark, Apex-Map. It is a parameterized synthetic performance probe and integrates concepts for temporal and spatial locality into its design. We measured Apex-Map performance for a whole range of temporal and spatial localities on several advanced processors and parallel computing platforms and use the generated performance surfaces forperformance comparisons and to study the characteristics of these different architectures. We demonstrate that the results of Apex-Map clearly reflect many specific characteristics of the used systems. We also show the utility of Apex-Map for analyzing the performance effects of three leading parallel programming models and demonstrate their relative merits.

Archive | 2007

Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture

John M. Levesque; Jeff Larkin; Martyn Foster; Joe Glenski; Garry Geissler; Stephen Whalen; Brian Waldecker; Jonathan Carter; David Skinner; Helen He; Harvey Wasserman; John Shalf; Hongzhang Shan; Erich Strohmaier

Over the past 15 years, microprocessor performance hasdoubled approximately every 18 months through increased clock rates andprocessing efficiency. In the past few years, clock frequency growth hasstalled, and microprocessor manufacturers such as AMD have moved towardsdoubling the number of cores every 18 months in order to maintainhistorical growth rates in chip performance. This document investigatesthe ramifications of multicore processor technology on the new Cray XT4?systems based on AMD processor technology. We begin by walking throughthe AMD single-core and dual-core and upcoming quad-core processorarchitectures. This is followed by a discussion of methods for collectingperformance counter data to understand code performance on the Cray XT3?and XT4? systems. We then use the performance counter data to analyze theimpact of multicore processors on the performance of microbenchmarks suchas STREAM, application kernels such as the NAS Parallel Benchmarks, andfull application codes that comprise the NERSC-5 SSP benchmark suite. Weexplore compiler options and software optimization techniques that canmitigate the memory bandwidth contention that can reduce computingefficiency on multicore processors. The last section provides a casestudy of applying the dual-core optimizations to the NAS ParallelBenchmarks to dramatically improve their performance.

Explore More