Philip C. Roth | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Philip C. Roth is active.

Explore More

Publication

Featured researches published by Philip C. Roth.

architectural support for programming languages and operating systems | 2010

The Scalable Heterogeneous Computing (SHOC) benchmark suite

Anthony Danalis; Gabriel Marin; Collin McCurdy; Jeremy S. Meredith; Philip C. Roth; Kyle Spafford; Vinod Tipparaju; Jeffrey S. Vetter

Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOCs initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.

conference on high performance computing (supercomputing) | 2003

MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Philip C. Roth; Dorian C. Arnold; Barton P. Miller

We present MRNet, a software-based multicast/reduction network for building scalable performance and system administration tools. MRNet supports multiple simultaneous, asynchronous collective communication operations. MRNet is flexible, allowing tool builders to tailor its process network topology to suit their tools requirements and the underlying systems capabilities. MRNet is extensible, allowing tool builders to incorporate custom data reductions to augment its collection of built-in reductions. We evaluated MRNet in a simple test tool and also integrated into an existing, real-world performance tool with up to 512 tool back-ends. In the real-world tool, we used MRNet not only for multicast and simple data reductions but also with custom histogram and clock skew detection reductions. In our experiments, the MRNet-based tools showed significantly better performance than the tools without MRNet for average message latency and throughput, overall tool start-up latency, and performance data processing throughput.

Computing in Science and Engineering | 2011

Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community

Jeffrey S. Vetter; Richard Glassbrook; Jack J. Dongarra; Karsten Schwan; Bruce Loftis; Stephen McNally; Jeremy S. Meredith; James H. Rogers; Philip C. Roth; Kyle Spafford; Sudhakar Yalamanchili

The Keeneland projects goal is to develop and deploy an innovative, GPU-based high-performance computing system for the NSF computational science community.

ieee international symposium on workload characterization | 2006

Characterization of Scientific Workloads on Systems with Multi-Core Processors

Sadaf R. Alam; Richard Frederick Barrett; Jeffery A. Kuehn; Philip C. Roth; Jeffrey S. Vetter

Multi-core processors are planned for virtually all next-generation HPC systems. In a preliminary evaluation of AMD Opteron Dual-Core processor systems, we investigated the scaling behavior of a set of micro-benchmarks, kernels, and applications. In addition, we evaluated a number of processor affinity techniques for managing memory placement on these multi-core systems. We discovered that an appropriate selection of MPI task and memory placement schemes can result in over 25% performance improvement for key scientific calculations. We collected detailed performance data for several large-scale scientific applications. Analyses of the application performance results confirmed our micro-benchmark and scaling results

ieee international conference on high performance computing data and analytics | 2008

Early evaluation of IBM BlueGene/P

Sadaf R. Alam; Richard Frederick Barrett; Michael H Bast; Mark R. Fahey; Jeffery A. Kuehn; Collin McCurdy; James H. Rogers; Philip C. Roth; Ramanan Sankaran; Jeffrey S. Vetter; Patrick H. Worley; Weikuan Yu

BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4s Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.

acm sigplan symposium on principles and practice of parallel programming | 2006

On-line automated performance diagnosis on thousands of processes

Philip C. Roth; Barton P. Miller

Performance analysis tools are critical for the effective use of large parallel computing resources, but existing tools have failed to address three problems that limit their scalability: (1) management and processing of the volume of performance data generated when monitoring a large number of application processes, (2) communication between a large number of tool components, and (3) presentation of performance data and analysis results for applications with a large number of processes. In this paper, we present a novel approach for finding performance problems in applications with a large number of processes that leverages our multicast and data aggregation infrastructure to address these three performance tool scalability barriers.First, we show how to design a scalable, distributed performance diagnosis facility. We demonstrate this design with an on-line, automated strategy for finding performance bottlenecks. Our strategy uses distributed, independent bottleneck search agents located in the tool agent processes that monitor running application processes. Second, we present a technique for constructing compact displays of the results of our bottleneck detection strategy. This technique, called the Sub-Graph Folding Algorithm, presents bottleneck search results using dynamic graphs that record the refinement of a bottleneck search. The complexity of the results graph is controlled by combining sub-graphs showing similar local application behavior into a composite sub-graph.Using an approach that combines these two synergistic parts, we performed bottleneck searches on programs with up to 1024 processes with no sign of tool resource saturation. With 1024 application processes, our visualization technique reduced a search results graph containing over 30,000 nodes to a single composite 44-node graph sub-graph showing the same qualitative performance information as the original graph.

ieee international conference on high performance computing data and analytics | 1997

Real-Time Statistical Clustering for Event Trace Reduction

Oleg Y. Nickolayev; Philip C. Roth; Daniel A. Reed

Event tracing provides the detailed data needed to under stand the dynamics of interactions among application resource demands and system responses. However, cap turing the large volume of dynamic performance data inherent in detailed tracing can perturb program execution and stress secondary storage systems. Moreover, it can overwhelm a user or performance analyst with potentially irrelevant data. Using the Pablo performance environ ments support for real-time data analysis, we show that dynamic statistical data clustering can dramatically reduce the volume of captured performance data by identifying and recording event traces only from representative proc essors. In turn, this makes possible low overhead, interac tive visualization, and performance tuning.

petascale data storage workshop | 2009

Scalable I/O tracing and analysis

Karthik Vijayakumar; Frank Mueller; Xiaosong Ma; Philip C. Roth

As supercomputer performance approached and then surpassed the petaflop level, I/O performance has become a major performance bottleneck for many scientific applications. Several tools exist to collect I/O traces to assist in the analysis of I/O performance problems. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information. We propose a multi-level trace generator tool, ScalaIOTrace, that collects traces at several levels in the HPC I/O stack. ScalaIOTrace features aggressive trace compression that generates trace files of near constant size for regular I/O patterns and orders of magnitudes smaller for less regular ones. This enables the collection of I/O and communication traces of applications running on thousands of processors. Our contributions also include automated trace analysis to collect selected statistical information of I/O calls by parsing the compressed trace on-the-fly and time-accurate replay of communication events with MPI-IO calls. We evaluated our approach with the Parallel Ocean Program (POP) climate simulation and the FLASH parallel I/O benchmark. POP uses NetCDF as an I/O library while FLASH I/O uses the parallel HDF5 I/O library, which internally maps onto MPI-IO. We collected MPI-IO and low-level POSIX I/O traces to study application I/O behavior. Our results show constant size trace files of only 145KB irrespective of the number of nodes for FLASH I/O benchmark, which exhibits regular I/O and communication pattern. For POP, we observe up to two orders of magnitude reduction in trace file sizes compared to flat traces. Statistical information gathered reveals insight on the number of I/O and communication calls issued in the POP and FLASH I/O. Such concise traces are unprecedented for isolated I/O and combined I/O plus communication tracing.

computing frontiers | 2012

The tradeoffs of fused memory hierarchies in heterogeneous computing architectures

Kyle Spafford; Jeremy S. Meredith; Seyong Lee; Dong Li; Philip C. Roth; Jeffrey S. Vetter

With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMDs new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.

international parallel and distributed processing symposium | 2011

LACIO: A New Collective I/O Strategy for Parallel I/O Systems

Yong Chen; Xian-He Sun; Rajeev Thakur; Philip C. Roth; William Gropp

Parallel applications benefit considerably from the rapid advance of processor architectures and the available massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel systems. Collective I/O is widely considered a critical solution that exploits the correlation among I/O accesses from multiple processes of a parallel application and optimizes the I/O performance. However, the conventional collective I/O strategy makes the optimization decision based on the logical file layout to avoid multiple file system calls and does not take the physical data layout into consideration. On the other hand, the physical data layout in fact decides the actual I/O access locality and concurrency. In this study, we propose a new collective I/O strategy that is aware of the underlying physical data layout. We confirm that the new Layout-Aware Collective I/O (LACIO) improves the performance of current parallel I/O systems effectively with the help of noncontiguous file system calls. It holds promise in improving the I/O performance for parallel systems.

Explore More