I-Hsin Chung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where I-Hsin Chung is active.

Explore More

Publication

Featured researches published by I-Hsin Chung.

conference on high performance computing (supercomputing) | 2006

Topology mapping for Blue Gene/L supercomputer

Hao Yu; I-Hsin Chung; José E. Moreira

Mapping virtual processes onto physical processors is one of the most important issues in parallel computing. The problem of mapping of processes/tasks onto processors is equivalent to the graph embedding problem which has been studied extensively. Although many techniques have been proposed for embeddings of two-dimensional grids, hypercubes, etc., there are few efforts on embeddings of three-dimensional grids and tori. Motivated for better support of task mapping for Blue Gene/L supercomputer, in this paper, we present embedding and integration techniques for the embeddings of three-dimensional grids and tori. The topology mapping library that based on such techniques generates high-quality embeddings of two/three-dimensional grids/tori. In addition, the library is used in BG/L MPI library for scalable support of MPI topology functions. With extensive empirical studies on large scale systems against popular benchmarks and real applications, we demonstrate that the library can significantly improve the communication performance and the scalability of applications

conference on high performance computing (supercomputing) | 2006

MPI performance analysis tools on Blue Gene/L

I-Hsin Chung; Robert Walkup; Hui-Fang Wen; Hao Yu

Applications on todays massively parallel supercomputers rely on performance analysis tools to guide them toward scalable performance on thousands of processors. However, conventional tools for parallel performance analysis have serious problems due to the large data volume that may be required. In this paper, we discuss the scalability issue for MPI performance analysis on Blue Gene/L, the worlds fastest supercomputing platform. We present an experimental study of existing MPI performance tools that were ported to BG/L from other platforms. These tools can be classified into two categories: profiling tools that collect timing summaries, and tracing tools that collect a sequence of time-stamped events. Profiling tools produce small data volumes and can scale well, but tracing tools tend to scale poorly. The experimental study discusses the advantages and disadvantages for the tools in the two categories and will be helpful in the future performance tools design

ieee international conference on high performance computing data and analytics | 2014

Parallel deep neural network training for big data on blue gene/Q

I-Hsin Chung; Tara N. Sainath; Bhuvana Ramabhadran; Michael Picheny; John A. Gunnels; Vernon Austel; Upendra Chauhari; Brian Kingsbury

Deep Neural Networks (DNNs) have recently been shown to significantly outperform existing machine learning techniques in several pattern recognition tasks. DNNs are the state-of-the-art models used in image recognition, object detection, classification and tracking, and speech and language processing applications. The biggest drawback to DNNs has been the enormous cost in computation and time taken to train the parameters of the networks-often a tenfold increase relative to conventional technologies. Such training time costs can be mitigated by the application of parallel computing algorithms and architectures. However, these algorithms often run into difficulties because of the cost of inter-processor communication bottlenecks. In this paper, we describe how to enable Parallel Deep Neural Network Training on the IBM Blue Gene/Q (BG/Q) computer system. Specifically, we explore DNN training using the data-parallel Hessian-free 2nd order optimization algorithm. Such an algorithm is particularly well-suited to parallelization across a large set of loosely coupled processors. BG/Q, with its excellent inter-processor communication characteristics, is an ideal match for this type of algorithm. The paper discusses how issues regarding programming model and data-dependent imbalances are addressed. Results on large-scale speech tasks show that the performance on BG/Q scales linearly up to 4,096 processes with no loss in accuracy. This allows us to train neural networks using billions of training examples in a few hours.

Parallel Processing Letters | 2011

HIERARCHICAL MAPPING FOR HPC APPLICATIONS

I-Hsin Chung; Che-Rung Lee; Jiazheng Zhou; Yeh-Ching Chung

As the high performance computing systems scale up, mapping the tasks of a parallel application onto physical processors to allow efficient communication becomes one of the critical performance issues. Existing algorithms were usually designed to map applications with regular communication patterns. Their mapping criterion usually overlooks the size of communicated messages, which is the primary factor of communication time. In addition, most of their time complexities are too high to process large scale problems. In this paper, we present a hierarchical mapping algorithm (HMA), which is capable of mapping applications with irregular communication patterns. It first partitions tasks according to their run-time communication information. The tasks that communicate with each others more frequently are regarded as strongly connected. Based on their connectivity strength, the tasks are partitioned into super nodes based on the algorithms in spectral graph theory. The hierarchical partitioning reduces the mapping algorithm complexity to achieve scalability. Finally, the run-time communication information will be used again in fine tuning to explore better mappings. With the experiments, we show how the mapping algorithm helps to reduce the point-to-point communication time for the PDGEMM, a ScaLAPACK matrix multiplication computation kernel, up to 20% and the AMG2006, a tier 1 application of the Sequoia benchmark, up to 7%.

international parallel and distributed processing symposium | 2008

Early experiences in application level I/O tracing on blue gene systems

Seetharami R. Seelam; I-Hsin Chung; Ding-Yong Hong; Hui-Fang Wen; Hao Yu

On todays massively parallel processing (MPP) supercomputers, it is increasingly important to understand I/O performance of an application both to guide scalable application development and to tune its performance. These two critical steps are often enabled by performance analysis tools to obtain performance data on thousands of processors in an MPP system. To this end, we present the design, implementation, and early experiences of an application level I/O tracing library and the corresponding tool for analyzing and optimizing I/O performance on Blue Gene (BG) MPP systems. This effort was a part of IBM UPC Toolkit for BG systems. To our knowledge, this is the first comprehensive application-level I/O monitoring, playback, and optimizing tool available on BG systems. The preliminary experiments on popular NPB BTIO benchmark show that the tool is much useful on facilitating detailed I/O performance analysis.

international parallel and distributed processing symposium | 2008

A framework for automated performance bottleneck detection

I-Hsin Chung; Guojing Cong; David J. Klepacki; Simone Sbaraglia; Seetharami R. Seelam; Hui-Fang Wen

In this paper, we present the architecture design and implementation of a framework for automated performance bottleneck detection. The framework analyzes the time-spent distribution in the application and discovers the performance bottlenecks by using given bottleneck definitions. The user can query the application execution performance to identify performance problems. The design of the framework is flexible and extensible so it can be tailored based on the actual application execution environment and performance tuning requirement. To demonstrate the usefulness of the framework, we apply the framework on a practical DARPA application and show how it helps to identify performance bottlenecks. The framework helps to automate the performance tuning process and improve the users productivity.

ACM Transactions on Modeling and Computer Simulation | 2013

Optimizing Pairwise Box Intersection Checking on GPUs for Large-Scale Simulations

Shih-Hsiang Lo; Che-Rung Lee; I-Hsin Chung; Yeh-Ching Chung

Box intersection checking is a common task used in many large-scale simulations. Traditional methods cannot provide fast box intersection checking with large-scale datasets. This article presents a parallel algorithm to perform Pairwise Box Intersection checking on Graphics processing units (PBIG). The PBIG algorithm consists of three phases: planning, mapping and checking. The planning phase partitions the space into small cells, the sizes of which are determined to optimize performance. The mapping phase maps the boxes into the cells. The checking phase examines the box intersections in the same cell. Several performance optimizations, including load-balancing, output data compression/encoding, and pipelined execution, are presented for the PBIG algorithm. The experimental results show that the PBIG algorithm can process large-scale datasets and outperforms three well-performing algorithms.

quantitative evaluation of systems | 2007

A Productivity Centered Tools Framework for Application Performance Tuning

Hui-Fang Wen; Simone Sbaraglia; Seetharami R. Seelam; I-Hsin Chung; Guojing Cong; David J. Klepacki

Our productivity centered performance tuning framework for HPC applications comprises of three main components: (1) a versatile source code, performance metrics, and performance data visualization and analysis graphical user interface, (2) a unique source code and binary instrumentation engine, and (3) an array of data collection facilities to gather performance data across various dimensions including CPU, message passing, threads, memory and I/O. We believe that the ability to decipher performance impacts at the source level and the ability to probe the application with different tools at the same time at varying granularities, while hiding the complications of binary instrumentation, leads to higher productivity of scientists in understanding and tuning the performance of associated computing systems and applications.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Hierarchical Mapping for HPC Applications

I-Hsin Chung; Che-Rung Lee; Jiazheng Zhou; Yeh-Ching Chung

international parallel and distributed processing symposium | 2010

Masking I/O latency using application level I/O caching and prefetching on Blue Gene systems

Seetharami R. Seelam; I-Hsin Chung; John H. Bauer; Hui-Fang Wen

We present an application-level I/O caching, prefetching, asynchronous system to hide access latency experienced by HPC applications. Our solution of user controllable caching and prefetching system maintains a file-IO cache in the user space of the application, analyzes the I/O access patterns, prefetches requests, and performs write-back of dirty data to storage asynchronously. So each time the application needs the data it does not have to pay the full I/O latency penalty in going to the storage and getting the required data. We have implemented this caching and asynchronous access system on the Blue Gene (BG/L and BG/P) systems. We present experimental results with NAS BT, MADbench, and WRF benchmarks. The results on BG/P system demonstrate that our method hides access latency, enhances application I/O access time by as much as 100%, and improves WRF execution time over 10%.

Explore More