Hui-Fang Wen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hui-Fang Wen is active.

Explore More

Publication

Featured researches published by Hui-Fang Wen.

conference on high performance computing (supercomputing) | 2006

MPI performance analysis tools on Blue Gene/L

I-Hsin Chung; Robert Walkup; Hui-Fang Wen; Hao Yu

Applications on todays massively parallel supercomputers rely on performance analysis tools to guide them toward scalable performance on thousands of processors. However, conventional tools for parallel performance analysis have serious problems due to the large data volume that may be required. In this paper, we discuss the scalability issue for MPI performance analysis on Blue Gene/L, the worlds fastest supercomputing platform. We present an experimental study of existing MPI performance tools that were ported to BG/L from other platforms. These tools can be classified into two categories: profiling tools that collect timing summaries, and tracing tools that collect a sequence of time-stamped events. Profiling tools produce small data volumes and can scale well, but tracing tools tend to scale poorly. The experimental study discusses the advantages and disadvantages for the tools in the two categories and will be helpful in the future performance tools design

international parallel and distributed processing symposium | 2008

Early experiences in application level I/O tracing on blue gene systems

Seetharami R. Seelam; I-Hsin Chung; Ding-Yong Hong; Hui-Fang Wen; Hao Yu

On todays massively parallel processing (MPP) supercomputers, it is increasingly important to understand I/O performance of an application both to guide scalable application development and to tune its performance. These two critical steps are often enabled by performance analysis tools to obtain performance data on thousands of processors in an MPP system. To this end, we present the design, implementation, and early experiences of an application level I/O tracing library and the corresponding tool for analyzing and optimizing I/O performance on Blue Gene (BG) MPP systems. This effort was a part of IBM UPC Toolkit for BG systems. To our knowledge, this is the first comprehensive application-level I/O monitoring, playback, and optimizing tool available on BG systems. The preliminary experiments on popular NPB BTIO benchmark show that the tool is much useful on facilitating detailed I/O performance analysis.

international parallel and distributed processing symposium | 2008

A framework for automated performance bottleneck detection

I-Hsin Chung; Guojing Cong; David J. Klepacki; Simone Sbaraglia; Seetharami R. Seelam; Hui-Fang Wen

In this paper, we present the architecture design and implementation of a framework for automated performance bottleneck detection. The framework analyzes the time-spent distribution in the application and discovers the performance bottlenecks by using given bottleneck definitions. The user can query the application execution performance to identify performance problems. The design of the framework is flexible and extensible so it can be tailored based on the actual application execution environment and performance tuning requirement. To demonstrate the usefulness of the framework, we apply the framework on a practical DARPA application and show how it helps to identify performance bottlenecks. The framework helps to automate the performance tuning process and improve the users productivity.

quantitative evaluation of systems | 2007

A Productivity Centered Tools Framework for Application Performance Tuning

Hui-Fang Wen; Simone Sbaraglia; Seetharami R. Seelam; I-Hsin Chung; Guojing Cong; David J. Klepacki

Our productivity centered performance tuning framework for HPC applications comprises of three main components: (1) a versatile source code, performance metrics, and performance data visualization and analysis graphical user interface, (2) a unique source code and binary instrumentation engine, and (3) an array of data collection facilities to gather performance data across various dimensions including CPU, message passing, threads, memory and I/O. We believe that the ability to decipher performance impacts at the source level and the ability to probe the application with different tools at the same time at varying granularities, while hiding the complications of binary instrumentation, leads to higher productivity of scientists in understanding and tuning the performance of associated computing systems and applications.

international parallel and distributed processing symposium | 2010

Masking I/O latency using application level I/O caching and prefetching on Blue Gene systems

Seetharami R. Seelam; I-Hsin Chung; John H. Bauer; Hui-Fang Wen

We present an application-level I/O caching, prefetching, asynchronous system to hide access latency experienced by HPC applications. Our solution of user controllable caching and prefetching system maintains a file-IO cache in the user space of the application, analyzes the I/O access patterns, prefetches requests, and performs write-back of dirty data to storage asynchronously. So each time the application needs the data it does not have to pay the full I/O latency penalty in going to the storage and getting the required data. We have implemented this caching and asynchronous access system on the Blue Gene (BG/L and BG/P) systems. We present experimental results with NAS BT, MADbench, and WRF benchmarks. The results on BG/P system demonstrate that our method hides access latency, enhances application I/O access time by as much as 100%, and improves WRF execution time over 10%.

international parallel and distributed processing symposium | 2009

Application level I/O caching on Blue Gene/P systems

Seetharami R. Seelam; I-Hsin Chung; John H. Bauer; Hao Yu; Hui-Fang Wen

In this paper, we present an application level aggressive I/O caching and prefetching system to hide I/O access latency experienced by out-of-core applications. Without the application level prefetching and caching capability, users of I/O intensive applications need to rewrite them with asynchronous I/O calls or restructure their code with MPI-IO calls to efficiently use the large scale system resources. Our proposed solution of user controllable aggressive caching and prefetching system maintains a file-IO cache in the user space of the application, analyzes the I/O access patterns, prefetches requests, and performs write-back of dirty data to storage asynchronously. So each time the application needs the data it does not have to pay the full I/O latency penalty in going to the storage and getting the required data. We have implemented this aggressive caching and asynchronous prefetching on the Blue Gene/P (BGP) system. The preliminary experiment evaluates the caching performance using the WRF benchmark. The results on BGP system demonstrate that our method improves application I/O throughput.

ieee international conference on high performance computing data and analytics | 2012

Application data prefetching on the IBM blue gene/Q supercomputer

I-Hsin Chung; Changhoan Kim; Hui-Fang Wen; Guojing Cong

Memory access latency is often a crucial performance limitation for high performance computing. Prefetching is one of the strategies used by system designers to bridge the processor-memory gap. This paper describes a new innovative list prefetching feature introduced in the IBM Blue Gene/Q supercomputer. The list prefetcher records the L1 cache miss addresses and prefetches them in the next iteration. The evaluation shows this list prefetching mechanism reduces data fetching time when L1 cache misses happen and improves the performance for high performance computing applications with repeating nonuniform memory access patterns. Its performance is compatible with classic stream prefetcher when properly configured.

IEEE Transactions on Parallel and Distributed Systems | 2012

A Systematic Approach toward Automated Performance Analysis and Tuning

Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama

High productivity is critical in harnessing the power of high-performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in programming language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper, we propose a systematic approach toward automated performance analysis and tuning that we expect to improve the productivity of performance debugging significantly. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.

european conference on parallel processing | 2009

A Holistic Approach towards Automated Performance Analysis and Tuning

Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama

High productivity to the end user is critical in harnessing the power of high performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper we propose a holistic approach towards automated performance analysis and tuning that we expect to greatly improve the productivity of performance debugging. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.

international parallel and distributed processing symposium | 2009

Towards a framework for automated performance tuning

Guojing Cong; Seetharami R. Seelam; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki

As part of the DARPA sponsored High Productivity Computing Systems (HPCS) program, IBM is building petaflop supercomputers that will be fast, power-efficient, and easy to program. In addition to high performance, high productivity to the end user is another prominent goal. The challenge is to develop technologies that bridge the productivity gap - the gap between the hardware complexity and the software limitations. In addition to language, compiler, and runtime research, powerful and user-friendly performance tools are critical in debugging performance problems and tuning for maximum performance. Traditional tools have either focused on specific performance aspects (e.g., communication problems) or provided limited diagnostic capabilities, and using them alone usually do not pinpoint accurately performance problems. Even fewer tools attempt to provide solutions for problems detected. In our study, we develop an open framework that unifies tools, compiler analysis, and expert knowledge to automatically analyze and tune the performance of an application. Preliminary results demonstrated the efficiency of our approach.

Explore More