Daniel Terpstra | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Terpstra is active.

Explore More

Publication

Featured researches published by Daniel Terpstra.

international conference on parallel processing | 2012

Measuring Energy and Power with PAPI

Vincent M. Weaver; Matt Johnson; Kiran Kasichayanula; James Ralph; Piotr Luszczek; Daniel Terpstra; Shirley Moore

Energy and power consumption are becoming critical metrics in the design and usage of high performance systems. We have extended the Performance API (PAPI) analysis library to measure and report energy and power values. These values are reported using the existing PAPI API, allowing code previously instrumented for performance counters to also measure power and energy. Higher level tools that build on PAPI will automatically gain support for power and energy readings when used with the newest version of PAPI. We describe in detail the types of energy and power readings available through PAPI. We support external power meters, as well as values provided internally by recent CPUs and GPUs. Measurements are provided directly to the instrumented process, allowing immediate code analysis in real time. We provide examples showing results that can be obtained with our infrastructure.

Parallel Tools Workshop | 2010

Collecting Performance Data with PAPI-C

Daniel Terpstra; Heike Jagode; Haihang You; Jack J. Dongarra

Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface.

international parallel and distributed processing symposium | 2003

Experiences and lessons learned with a portable interface to hardware performance counters

Jack J. Dongarra; Kevin S. London; Shirley Moore; Philip Mucci; Daniel Terpstra; Haihang You; Min Zhou

The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI.

international symposium on performance analysis of systems and software | 2013

Non-determinism and overcount on modern hardware performance counter implementations

Vincent M. Weaver; Daniel Terpstra; Shirley Moore

Ideal hardware performance counters provide exact deterministic results. Real-world performance monitoring unit (PMU) implementations do not always live up to this ideal. Events that should be exact and deterministic (such as retired instructions) show run-to-run variation and overcount on ×86_64 machines, even when run in strictly controlled environments. These effects are non-intuitive to casual users and cause difficulties when strict determinism is desirable, such as when implementing deterministic replay or deterministic threading libraries. We investigate eleven different x86 64 CPU implementations and discover the sources of divergence from expected count totals. Of all the counter events investigated, we find only a few that exhibit enough determinism to be used without adjustment in deterministic execution environments. We also briefly investigate ARM, IA64, POWER and SPARC systems and find that on these platforms the counter events have more determinism. We explore various methods of working around the limitations of the ×86_64 events, but in many cases this is not possible and would require architectural redesign of the underlying PMU.

international conference on parallel processing | 2011

Evaluation of the HPC challenge benchmarks in virtualized environments

Piotr Luszczek; Eric Meek; Shirley Moore; Daniel Terpstra; Vincent M. Weaver; Jack J. Dongarra

This paper evaluates the performance of the HPC Challenge benchmarks in several virtual environments, including VMware, KVM and VirtualBox. The HPC Challenge benchmarks consist of a suite of tests that examine the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance LINPACK (HPL) benchmark used in the TOP500 list. The tests include four local (matrix-matrix multiply, STREAM, RandomAccess and FFT) and four global (High Performance Linpack --- HPL, parallel matrix transpose --- PTRANS, RandomAccess and FFT) kernel benchmarks. The purpose of our experiments is to evaluate the overheads of the different virtual environments and investigate how different aspects of the system are affected by virtualization. We ran the benchmarks on an 8-core system with Core i7 processors using Open MPI. We did runs on the bare hardware and in each of the virtual environments for a range of problem sizes. As expected, the HPL results had some overhead in all the virtual environments, with the overhead becoming less significant with larger problem sizes. The RandomAccess results show drastically different behavior and we attempt to explain it with pertinent experiments. We show the cause of variability of performance results as well as major causes of measurement error.

international symposium on performance analysis of systems and software | 2013

PAPI 5: Measuring power, energy, and the cloud

Vincent M. Weaver; Daniel Terpstra; Heike McCraw; Matt Johnson; Kiran Kasichayanula; James Ralph; John S. Nelson; Philip Mucci; Tushar Mohan; Shirley Moore

The PAPI library [1] was originally developed to provide portable access to the hardware performance counters found on a diverse collection of modern microprocessors. Rather than learning and writing to a new performance infrastructure each time code is moved to a new machine, measurement code can be written to the PAPI API which abstracts away the underlying interface. Over time, other system components besides the processor have gained performance interfaces (for example, GPUs and network interfaces). PAPI was redesigned to have a component architecture to allow modular access to these new sources of performance data [2]. In addition to incremental changes in processor support, the recent PAPI 5 release adds support for two emerging concerns in the high-performance landscape: energy consumption and cloud computing. As processor densities climb, the thermal properties and energy usage of high performance systems are becoming increasingly important. We have extended the PAPI interface to simultaneously monitor processor metrics, thermal sensors, and power meters to provide clues for correlating algorithmic activity with thermal response and energy consumption. We have also extended PAPI to provide support for running inside of Virtual Machines (VMs). This ongoing work will enable developers to use PAPI to engage in performance analysis in a virtualized cloud environment.

2003 User Group Conference. Proceedings | 2003

PAPI deployment, evaluation, and extensions

Shirley Moore; Daniel Terpstra; Kevin S. London; Philip Mucci; Patricia J. Teller; Leonardo Salayandia; Alonso Bayona; Manuel Nieto

PAPI is a cross-platform interface to the hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to processor functions. Monitoring these events has a variety of uses in application development, including performance modeling and optimization, debugging, and benchmarking. In addition to routines for accessing the counters, PAPI specifies a common set of performance metrics considered most relevant to analyzing and tuning application performance. These metrics include cycle and instruction counts, cache and memory access statistics, and functional unit and pipeline status, as well as relevant SMP cache coherence events. PAPI is becoming a de facto industry standard and has been incorporated into several third-party research and commercial performance analysis tools. As in any physical system, the act of measuring perturbs the phenomenon being measured. Discrepancies in hardware counts and counter-related profiling data can result from other causes as well. A PET-sponsored project is deploying PAPI and related tools on DoD HPC Center platforms and evaluating and interpreting performance counter data on those platforms.

international conference on parallel processing | 2012

PAPI-V: Performance Monitoring for Virtual Machines

Matt Johnson; Heike McCraw; Shirley Moore; Philip Mucci; John S. Nelson; Daniel Terpstra; Vincent M. Weaver; Tushar Mohan

This paper describes extensions to the PAPI hardware counter library for virtual environments, called PAPI-V. The extensions support timing routines, I/O measurements, and processor counters. The PAPI-V extensions will allow application and tool developers to use a familiar interface to obtain relevant hardware performance monitoring information in virtual environments.

international supercomputing conference | 2013

Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q

Heike McCraw; Daniel Terpstra; Jack J. Dongarra; Kris Davis; Roy G. Musselman

The Blue Gene/Q (BG/Q) system is the third generation in the IBM Blue Gene line of massively parallel, energy efficient supercomputers that increases not only in size but also in complexity compared to its Blue Gene predecessors. Consequently, gaining insight into the intricate ways in which software and hardware are interacting requires richer and more capable performance analysis methods in order to be able to improve efficiency and scalability of applications that utilize this advanced system.

international symposium on performance analysis of systems and software | 2014

MIAMI: A framework for application performance diagnosis

Gabriel Marin; Jack J. Dongarra; Daniel Terpstra

A typical application tuning cycle repeats the following three steps in a loop: performance measurement, analysis of results, and code refactoring. While performance measurement is well covered by existing tools, analysis of results to understand the main sources of inefficiency and to identify opportunities for optimization is generally left to the user. Todays state of the art performance analysis tools use instrumentation or hardware counter sampling to measure the performance of interactions between code and the target architecture during execution. Such measurements are useful to identify hotspots in applications, places where execution time is spent or where cache misses are incurred. However, explanatory understanding of tuning opportunities requires a more detailed, mechanistic modeling approach. This paper presents MIAMI (Machine Independent Application Models for performance Insight), a set of tools for automatic performance diagnosis. MIAMI uses application characterization and models of target architectures to reason about an applications performance. MIAMI uses a modeling approach based on first-order principles to identify performance bottlenecks, pinpoint optimization opportunities, and compute bounds on the potential for improvement.

Explore More