John D. McCalpin
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by John D. McCalpin.
ieee international conference on high performance computing data and analytics | 2010
Martin Burtscher; Byoung-Do Kim; Jeffrey R. Diamond; John D. McCalpin; Lars Koesterke; James C. Browne
HPC systems are notorious for operating at a small fraction of their peak performance, and the ongoing migration to multi-core and multi-socket compute nodes further complicates performance optimization. The readily available performance evaluation tools require considerable effort to learn and utilize. Hence, most HPC application writers do not use them. As remedy, we have developed PerfExpert, a tool that combines a simple user interface with a sophisticated analysis engine to detect probable core, socket, and node-level performance bottlenecks in each important procedure and loop of an application. For each bottle-neck, PerfExpert provides a concise performance assessment and suggests steps that can be taken by the programmer to improve performance. These steps include compiler switches and optimization strategies with code examples. We have applied PerfExpert to several HPC production codes on the Ranger supercomputer. In all cases, it correctly identified the critical code sections and provided accurate assessments of their performance.
Ibm Journal of Research and Development | 2005
Harry M. Mathis; Alex E. Mericas; John D. McCalpin; Richard J. Eickemeyer; Steven R. Kunkel
Coarse-grained multithreading, the switching of threads to avoid idle processor time during long-latency events, has been available on IBM systems since 1998. Simultaneous multithreading (SMT), first available on the POWER5TM processor, moves beyond simple thread switching to the maintenance of two thread streams that are issued as continuously as possible to ensure the maximum use of processor resources. Because SMT has the potential of increasing processor officiency and correspondingly increasing the amount of work done for a given time span, the reader might suppose that SMT would exhibit a performance gain for all workloads. This is true for most workloads, but is not true in some exceptional cases. In SMT mode, the processor resources--register sets, caches, queues, translation buffers, and the system memory nest--must be shared by both threads, and conditions can occur that degrade or even obviate SMT performance improvement. The POWER4TM and POWER5 processors have very powerful performance monitor (PM) toolsets that can help the user to determine what is occurring in workloads that may not be providing expected SMT gains. In this paper, the results of measured differences among workloads having large, medium, small, and even negative SMT performance gains are presented along with an approach to investigating workloads to determine the source of SMT performance gain limits.
international symposium on performance analysis of systems and software | 2011
Jeffrey R. Diamond; Martin Burtscher; John D. McCalpin; Byoung-Do Kim; Stephen W. Keckler; James C. Browne
The computation nodes of modern supercomputers commonly consist of multiple multicore processors. To maximize the performance of such systems requires measurement, analysis, and optimization techniques that specifically target multicore environments. This paper first examines traditional unicore metrics and demonstrates how they can be misleading in a multicore system. Second, it examines and characterizes performance bottlenecks specific to multicore-based systems. Third, it describes performance measurement challenges that arise in multicore systems and outlines methods for extracting sound measurements that lead to performance optimization opportunities. The measurement and analysis process is based on a case study of the HOMME atmospheric modeling benchmark code from NCAR running on supercomputers built upon AMD Barcelona and Intel Nehalem quad-core processors. Applying the multicore bottleneck analysis to HOMME led to multicore aware source-code optimizations that increased performance by up to 35%. While the case studies were carried out on multichip nodes of supercomputers using an HPC application as the target for optimization, the pitfalls identified and the insights obtained should apply to any system that is composed of multicore processors.
application specific systems architectures and processors | 2013
Ardavan Pedram; John D. McCalpin; Andreas Gerstlauer
This paper considers the modifications required to transform a highly-efficient, specialized linear algebra core into an efficient engine for computing Fast Fourier Transforms (FFTs). We review the minimal changes required to support Radix-4 FFT computations and propose extensions to the micro-architecture of the baseline linear algebra core. Along the way, we study the critical differences between the two classes of algorithms. Special attention is paid to the configuration of the on-chip memory system to support high utilization. We examine design trade-offs between efficiency, specialization and flexibility, and their effects both on the core and memory hierarchy for a unified design as compared to dedicated accelerators for each application. The final design is a flexible architecture that can perform both classes of applications. Results show that the proposed hybrid FFT/Linear Algebra core can achieve 26.6 GFLOPS/S with a power efficiency of 40 GFLOPS/W, which is up to 100× and 40× more energy efficient than cutting-edge CPUs and GPUs, respectively.
signal processing systems | 2014
Ardavan Pedram; John D. McCalpin; Andreas Gerstlauer
FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm2, respectively.
Archive | 1995
John D. McCalpin
Archive | 2003
Ravi Kumar Arimilli; John D. McCalpin; Francis Patrick O'Connell; William J. Starke
Archive | 2003
John D. McCalpin; Derek Edward Williams; Kenneth L. Wright; Balaram Sinharoy
Archive | 2004
Roch Georges Archambault; Robert James Blainey; Yaoqing Gao; John D. McCalpin; Francis Patrick O'Connell; Pascal Vezolle; Steven Wayne White
Archive | 2006
John D. McCalpin; William J. Starke; Jeffrey A. Stuecheli; Derek Edward Williams