Hung-Ching Chang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hung-Ching Chang is active.

Explore More

Publication

Featured researches published by Hung-Ching Chang.

IEEE Transactions on Parallel and Distributed Systems | 2010

PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications

Rong Ge; Xizhou Feng; Shuaiwen Song; Hung-Ching Chang; Dong Li; Kirk W. Cameron

Energy efficiency is a major concern in modern high-performance computing system design. In the past few years, there has been mounting evidence that power usage limits system scale and computing density, and thus, ultimately system performance. However, despite the impact of power and energy on the computer systems community, few studies provide insight to where and how power is consumed on high-performance systems and applications. In previous work, we designed a framework called PowerPack that was the first tool to isolate the power consumption of devices including disks, memory, NICs, and processors in a high-performance cluster and correlate these measurements to application functions. In this work, we extend our framework to support systems with multicore, multiprocessor-based nodes, and then provide in-depth analyses of the energy consumption of parallel applications on clusters of these systems. These analyses include the impacts of chip multiprocessing on power and energy efficiency, and its interaction with application executions. In addition, we use PowerPack to study the power dynamics and energy efficiencies of dynamic voltage and frequency scaling (DVFS) techniques on clusters. Our experiments reveal conclusively how intelligent DVFS scheduling can enhance system energy efficiency while maintaining performance.

Computer Science - Research and Development | 2012

Power-aware predictive models of hybrid (MPI/OpenMP) scientific applications on multicore systems

Charles W. Lively; Xingfu Wu; Valerie E. Taylor; Shirley Moore; Hung-Ching Chang; Chun-Yi Su; Kirk W. Cameron

Predictive models enable a better understanding of the performance characteristics of applications on multicore systems. Previous work has utilized performance counters in a system-centered approach to model power consumption for the system, CPU, and memory components. Often, these approaches use the same group of counters across different applications. In contrast, we develop application-centric models (based upon performance counters) for the runtime and power consumption of the system, CPU, and memory components. Our work analyzes four Hybrid (MPI/OpenMP) applications: the NAS Parallel Multizone Benchmarks (BT-MZ, SP-MZ, LU-MZ) and a Gyrokinetic Toroidal Code, GTC. Our models show that cache utilization (L1/L2), branch instructions, TLB data misses, and system resource stalls affect the performance of each application and performance component differently. We show that the L2 total cache hits counter affects performance across all applications. The models are validated for the system and component power measurements with an error rate less than 3%.

ieee international conference on high performance computing data and analytics | 2011

Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems

Charles W. Lively; Xingfu Wu; Valerie E. Taylor; Shirley Moore; Hung-Ching Chang; Kirk W. Cameron

Energy consumption is a major concern with high-performance multicore systems. In this paper, we explore the energy consumption and performance (execution time) characteristics of different parallel implementations of scientific applications. In particular, the experiments focus on message-passing interface (MPI)-only versus hybrid MPI/OpenMP implementations for hybrid the NAS (NASA Advanced Supercomputing) BT (Block Tridiagonal) benchmark (strong scaling), a Lattice Boltzmann application (strong scaling), and a Gyrokinetic Toroidal Code — GTC (weak scaling), as well as central processing unit (CPU) frequency scaling. Experiments were conducted on a system instrumented to obtain power information; this system consists of eight nodes with four cores per node. The results indicate, with respect to the MPI-only versus the hybrid implementation, that the best implementation is dependent upon the application executed on 16 or fewer cores. For the case of 32 cores, the results were consistent in that hybrid implementation resulted in less execution time and energy. With CPU frequency scaling, the best case for energy saving was not the best case for execution time.

international parallel and distributed processing symposium | 2014

The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications

Bo Li; Hung-Ching Chang; Shuaiwen Leon Song; Chun-Yi Su; Timmy Meyer; John Mooring; Kirk W. Cameron

Accelerators are used in about 13% of the current Top500 List. Supercomputers leveraging accelerators grew by a factor of 2.2x in 2012 and are expected to completely dominate the Top500 by 2015. Though most of these deployments use NVIDIA GPGPU accelerators, Intels Xeon Phi architecture will likely grow in popularity in the coming years. Unfortunately, there are few studies analyzing the performance and energy efficiency of systems leveraging the Intel Xeon Phi. We extend our systemic measurement methodology to isolate system power by component including accelerators. We use this methodology to present a detailed study of the performance-energy tradeoffs of the Xeon Phi architecture. We demonstrate the portability of our approach by comparing our Xeon Phi results to the Intel multicore Sandy Bridge host processor and the NVIDIA Tesla GPU for a wide range of HPC applications. Our results help explain limitations in the power-performance scalability of HPC applications on the current Intel Xeon Phi architecture.

international parallel and distributed processing symposium | 2015

LUC: Limiting the Unintended Consequences of Power Scaling on Parallel Transaction-Oriented Workloads

Hung-Ching Chang; Bo Li; Godmar Back; Ali Raza Butt; Kirk W. Cameron

Following an exhaustive set of experiments, we identify slowdowns in I/O performance that occur when processor power and frequency are increased. Our initial analyses indicate slowdowns are more likely to occur and more acute when the number of parallel I/O threads increases and the variability between runs is high. We use a micro benchmark-driven methodology to simplify isolation of the root causes of I/O performance loss. We classify the observed performance loss into two categories: file synchronization and file write delays. We introduce LUC, a runtime system to Limit the Unintended Consequences of power scaling and dynamically improve I/O performance. We demonstrate the effectiveness of the LUC system running on two platforms for two critical parallel transaction-oriented workloads including a mail server (vermeil) and online transaction processing (lotp).

Parallel Processing Letters | 2014

Extending PowerPack for Profiling and Analysis of High Performance Accelerator-Based Systems

Bo Li; Hung-Ching Chang; Shuaiwen Song; Chun-Yi Su; Timmy Meyer; John Mooring; Kirk W. Cameron

Accelerators offer a substantial increase in efficiency for high-performance systems offering speedups for computational applications that leverage hardware support for highly-parallel codes. However, the power use of some accelerators exceeds 200 watts at idle which means use at exascale comes at a significant increase in power at a time when we face a power ceiling of about 20 megawatts. Despite the growing domination of accelerator-based systems in the Top500 and Green500 lists of fastest and most efficient supercomputers, there are few detailed studies comparing the power and energy use of common accelerators. In this work, we conduct detailed experimental studies of the power usage and distribution of Xeon-Phi-based systems in comparison to the NVIDIA Tesla and an Intel Sandy Bridge multicore host processor. In contrast to previous work, we focus on separating individual component power and correlating power use to code behavior. Our results help explain the causes of power-performance scalability for a set of HPC applications.

Computer Science - Research and Development | 2014