Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David J. Klepacki.
Ibm Systems Journal | 1995
Ramesh C. Agarwal; Bowen Alpern; Larry Carter; Fred G. Gustavson; David J. Klepacki; Rick Lawrence; Mohammad Zubair
Recently, researchers at NASA Ames have defined a set of computational benchmarks designed to measure the performance of parallel supercomputers. In this paper, we describe the parallel implementation of the five kernel benchmarks from this suite on the IBM SP2™, a scalable, distributed memory parallel computer. High-performance implementations of these kernels have been obtained by mapping the computation of these kernels to the underlying architecture of the SP2 machine. Performance results for the SP2 are compared with publicly available results for other high-performance computers.
international parallel and distributed processing symposium | 2008
I-Hsin Chung; Guojing Cong; David J. Klepacki; Simone Sbaraglia; Seetharami R. Seelam; Hui-Fang Wen
In this paper, we present the architecture design and implementation of a framework for automated performance bottleneck detection. The framework analyzes the time-spent distribution in the application and discovers the performance bottlenecks by using given bottleneck definitions. The user can query the application execution performance to identify performance problems. The design of the framework is flexible and extensible so it can be tailored based on the actual application execution environment and performance tuning requirement. To demonstrate the usefulness of the framework, we apply the framework on a practical DARPA application and show how it helps to identify performance bottlenecks. The framework helps to automate the performance tuning process and improve the users productivity.
quantitative evaluation of systems | 2007
Hui-Fang Wen; Simone Sbaraglia; Seetharami R. Seelam; I-Hsin Chung; Guojing Cong; David J. Klepacki
Our productivity centered performance tuning framework for HPC applications comprises of three main components: (1) a versatile source code, performance metrics, and performance data visualization and analysis graphical user interface, (2) a unique source code and binary instrumentation engine, and (3) an array of data collection facilities to gather performance data across various dimensions including CPU, message passing, threads, memory and I/O. We believe that the ability to decipher performance impacts at the source level and the ability to probe the application with different tools at the same time at varying granularities, while hiding the complications of binary instrumentation, leads to higher productivity of scientists in understanding and tuning the performance of associated computing systems and applications.
IEEE Transactions on Parallel and Distributed Systems | 2012
Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama
High productivity is critical in harnessing the power of high-performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in programming language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper, we propose a systematic approach toward automated performance analysis and tuning that we expect to improve the productivity of performance debugging significantly. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.
european conference on parallel processing | 2009
Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama
High productivity to the end user is critical in harnessing the power of high performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper we propose a holistic approach towards automated performance analysis and tuning that we expect to greatly improve the productivity of performance debugging. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.
international parallel and distributed processing symposium | 2009
Guojing Cong; Seetharami R. Seelam; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki
As part of the DARPA sponsored High Productivity Computing Systems (HPCS) program, IBM is building petaflop supercomputers that will be fast, power-efficient, and easy to program. In addition to high performance, high productivity to the end user is another prominent goal. The challenge is to develop technologies that bridge the productivity gap - the gap between the hardware complexity and the software limitations. In addition to language, compiler, and runtime research, powerful and user-friendly performance tools are critical in debugging performance problems and tuning for maximum performance. Traditional tools have either focused on specific performance aspects (e.g., communication problems) or provided limited diagnostic capabilities, and using them alone usually do not pinpoint accurately performance problems. Even fewer tools attempt to provide solutions for problems detected. In our study, we develop an open framework that unifies tools, compiler analysis, and expert knowledge to automatically analyze and tune the performance of an application. Preliminary results demonstrated the efficiency of our approach.
high performance computing and communications | 2008
Seetharami R. Seelam; I-Hsin Chung; Guojing Cong; Hui-Fang Wen; David J. Klepacki
It is critical to understand the workload characteristics and resource usage patterns of available applications to guide the design and development of hardware and software stacks of the future machines. In this paper, we analyze the workload performance characteristics of three large-scale DARPA HPCS benchmarks: HYCOM, POP, and LBMHD while executing on IBM Power5+ processor machines. Our analysis is focused on CPU/memory performance using cycles per instruction (CPI) model and multiprocess communication performance using MPI traces. For each benchmark, we provide a high level performance analysis followed by the hot-spot analysis of codes for selected input parameters.Then we present a detailed workload performance characterization using CPI model with data from a unique set of performance counters available on the Power5+ processor system. For communication, we describe the sources of load imbalances in the applications and identify the potential impediments to scalability of the applications under large processor counts.We identify several sources of performance problems that are potential bottlenecks and discuss methods to ameliorate them.
Geophysics | 2003
David J. Klepacki
The current approach to handling very large compute-intensive and data-intensive applications is to deploy the cumulative effect of a large number of processors in parallel, either via a networked cluster of computers or via a more tightly coupled massively parallel system. As simulation models evolve in complexity together with the accelerated proliferation of modeling and imaging data, this approach to computing quickly becomes limited by issues of power consumption, cooling, and physical size, not to mention cost and performance. The challenge is to construct a dense air-cooled machine with high performance and low cost.
international parallel and distributed processing symposium | 2012
Guojing Cong; Hui-Fang Wen; I-Hsin Chung; David J. Klepacki; Hiroki Murata; Yasushi Negishi
Deploying an application onto a target platform for high performance oftentimes demands manual tuning by experts. As machine architecture gets increasingly complex, tuning becomes even more challenging and calls for systematic approaches. In our earlier work we presented a prototype that combines efficiently expert knowledge, static analysis, and runtime observation for bottleneck detection, and employs refactoring and compiler feedback for mitigation. In this study, we develop a software tool that facilitates \emph{fast} searching of bottlenecks and effective mitigation of problems from major dimensions of computing (e.g., computation, communication, and I/O). The impact of our approach is demonstrated by the tuning of the LBMHD code and a Poisson solver code, representing traditional scientific codes, and a graph analysis code in UPC, representing emerging programming paradigms. In the experiments, our framework detects with a single run of the application intricate bottlenecks of memory access, I/O, and communication. Moreover, the automated solution implementation yields significant overall performance improvement on the target platforms. The improvement for LBMHD is up to 45\%, and the speedup for the UPC code is up to 5. These results suggest that our approach is a concrete step towards systematic tuning of high performance computing applications.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama
To fully utilize the power of current high performance computing systems, high productivity to the end user is critical. It is a challenge to map an application to the target architecture efficiently. Tuning an application for high performance remains a daunting task, and frequently involves manual changes to the program. Recently refactoring techniques are proposed to rewrite or reorganize programs for various software engineering purposes. In our research we explore combining performance analysis with refactoring techniques for automated tuning that we expect to greatly improve the productivity of application deployment. We seek to build a system that can apply appropriate refactoring according to the bottleneck discovered. We demonstrate the effectiveness of this approach through the tuning of several scientific applications and kernels.