Philip Mucci
University of Tennessee
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Philip Mucci.
ieee international conference on high performance computing data and analytics | 2000
Shirley Browne; Jack J. Dongarra; Nathan Garner; George T. S. Ho; Philip Mucci
The purpose of the PAPI project is to specify a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals and states related to the processor’s function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis, including hand tuning, compiler optimization, debugging, benchmarking, monitoring, and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development toward alleviating commonly occurring bottlenecks in high performance computing.
international parallel and distributed processing symposium | 2003
Jack J. Dongarra; Kevin S. London; Shirley Moore; Philip Mucci; Daniel Terpstra; Haihang You; Min Zhou
The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI.
international conference on computational science | 2004
Jack J. Dongarra; Shirley Moore; Philip Mucci; Keith Seymour; Haihang You
We have developed a set of microbenchmarks for accurately determining the structural characteristics of data cache memories and TLBs. These characteristics include cache size, cache line size, cache associativity, memory page size, number of data TLB entries, and data TLB associativity. Unlike previous microbenchmarks that used time-based measurements, our microbenchmarks use hardware event counts to more accurately and quickly determine these characteristics while requiring fewer limiting assumptions.
international symposium on performance analysis of systems and software | 2013
Vincent M. Weaver; Daniel Terpstra; Heike McCraw; Matt Johnson; Kiran Kasichayanula; James Ralph; John S. Nelson; Philip Mucci; Tushar Mohan; Shirley Moore
The PAPI library [1] was originally developed to provide portable access to the hardware performance counters found on a diverse collection of modern microprocessors. Rather than learning and writing to a new performance infrastructure each time code is moved to a new machine, measurement code can be written to the PAPI API which abstracts away the underlying interface. Over time, other system components besides the processor have gained performance interfaces (for example, GPUs and network interfaces). PAPI was redesigned to have a component architecture to allow modular access to these new sources of performance data [2]. In addition to incremental changes in processor support, the recent PAPI 5 release adds support for two emerging concerns in the high-performance landscape: energy consumption and cloud computing. As processor densities climb, the thermal properties and energy usage of high performance systems are becoming increasingly important. We have extended the PAPI interface to simultaneously monitor processor metrics, thermal sensors, and power meters to provide clues for correlating algorithmic activity with thermal response and energy consumption. We have also extended PAPI to provide support for running inside of Virtual Machines (VMs). This ongoing work will enable developers to use PAPI to engage in performance analysis in a virtualized cloud environment.
2003 User Group Conference. Proceedings | 2003
Shirley Moore; Daniel Terpstra; Kevin S. London; Philip Mucci; Patricia J. Teller; Leonardo Salayandia; Alonso Bayona; Manuel Nieto
PAPI is a cross-platform interface to the hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to processor functions. Monitoring these events has a variety of uses in application development, including performance modeling and optimization, debugging, and benchmarking. In addition to routines for accessing the counters, PAPI specifies a common set of performance metrics considered most relevant to analyzing and tuning application performance. These metrics include cycle and instruction counts, cache and memory access statistics, and functional unit and pipeline status, as well as relevant SMP cache coherence events. PAPI is becoming a de facto industry standard and has been incorporated into several third-party research and commercial performance analysis tools. As in any physical system, the act of measuring perturbs the phenomenon being measured. Discrepancies in hardware counts and counter-related profiling data can result from other causes as well. A PET-sponsored project is deploying PAPI and related tools on DoD HPC Center platforms and evaluating and interpreting performance counter data on those platforms.
international conference on parallel processing | 2012
Matt Johnson; Heike McCraw; Shirley Moore; Philip Mucci; John S. Nelson; Daniel Terpstra; Vincent M. Weaver; Tushar Mohan
This paper describes extensions to the PAPI hardware counter library for virtual environments, called PAPI-V. The extensions support timing routines, I/O measurements, and processor counters. The PAPI-V extensions will allow application and tool developers to use a familiar interface to obtain relevant hardware performance monitoring information in virtual environments.
european conference on parallel processing | 2005
Philip Mucci; Daniel Ahlin; Johan Danielsson; Per Ekman; Lars Malinowski
We present PerfMiner, a system for the transparent collection, storage and presentation of thread-level hardware performance data across an entire cluster. Every sub-process/thread spawned by the user through the batch system is measured with near zero overhead and no dilation of run-time. Performance metrics are collected at the thread level using tool built on top of the Performance Application Programming Interface (PAPI). As the hardware counters are virtualized by the OS, the resulting counts are largely unaffected by other kernel or user processes. PerfMiner correlates this performance data with metadata from the batch system and places it in a database. Through a command line and web interface, the user can make queries to the database to report information on everything from overall workload characterization and system utilization to the performance of a single thread in a specific application. This is in contrast to other monitoring systems that report aggregate system-wide metrics sampled over a period of time. In this paper, we describe our implementation of PerfMiner as well as present some results from the test deployment of PerfMiner across three different clusters at the Center for Parallel Computers at The Royal Institute of Technology in Stockholm, Sweden.
european conference on parallel processing | 2009
Karl Fürlinger; Daniel Terpstra; Haihang You; Philip Mucci; Shirley Moore
An interesting and as of yet under-represented aspect of program development and optimization are data structures. Instead of analyzing data with respect to code regions, the objective here is to see how performance metrics are related to data structures. With the advanced performance monitoring unit of Intels Itanium processor series such an analysis becomes possible. This paper describes how the hardware features of the Itanium 2 processor are exploited by the perfmon and PAPI performance monitoring APIs and how PAPIs support for address range restrictions has been integrated into an existing profiling tool to achieve the goal of data structure oriented profiling in the context of OpenMP applications.
conference on high performance computing (supercomputing) | 2006
Philip Mucci; Shirley Moore
The PAPI cross-platform interface to the hardware performance counters available on most microprocessors has become widely adopted and used for application performance analysis. PAPI is now incorporated into a number of end-user performance analysis tools, including both research and vendor tools. In years past, the PAPI Users BOF has provided an excellent opportunity for interaction between PAPI developers, performance tool developers, and other users of PAPI. Several suggestions from past BOFs have been implemented in subsequent versions of PAPI. The purpose of this BOF will be to present the features of the latest release of PAPI and to get feedback from tool developers and PAPI end-users on these features and on future directions for PAPI. Support for emerging architectures and interfaces, such as the Cell multiprocessor, and the perfmon2 effort to implement a standard Linux kernel interface to access the hardware counters, will also be discussed.
Parallel Computational Fluid Dynamics 1999#R##N#Towards Teraflops, Optimization and Novel Formulations | 2000
Punyam Satya-narayana; Ravikanth V. R. Avancha; Philip Mucci
Publisher Summary Large eddy simulation (LES) is currently one of the popular approaches for the numerical simulation of turbulent flows. This code, written in FORTRAN, has been well optimized for use on vector processor machines such as the CRAY C90/T90. Increasing popularity and availability of relatively cost-effective machines using the RISC based NUMA architecture such as the Origin2000 make it necessary to migrate codes from the CRAY C90/T90. It is well known that CFD codes are among the toughest class of problems to port and optimize for RISC based NUMA architectures. This chapter discusses strategies adopted towards the creation of a shared memory version of the code and its optimization. Large eddy simulations of the turbulent flow in a channel are then performed on an Origin2000 system, and the corresponding results compared with those from simulations on a CRAY T90, to check for their accuracy. Scaling studies from the parallel version are also presented, demonstrating the importance of cache optimization on NUMA machines.