Chia Heng Tu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chia Heng Tu is active.

Explore More

Publication

Featured researches published by Chia Heng Tu.

ieee international conference on high performance computing data and analytics | 2009

Machine learning-based prefetch optimization for data center applications

Shih-Wei Liao; Tzu Han Hung; Donald Nguyen; Chin-Yen Chou; Chia Heng Tu; Hucheng Zhou

Performance tuning for data centers is essential and complicated. It is important since a data center comprises thousands of machines and thus a single-digit performance improvement can significantly reduce cost and power consumption. Unfortunately, it is extremely difficult as data centers are dynamic environments where applications are frequently released and servers are continually upgraded. In this paper, we study the effectiveness of different processor prefetch configurations, which can greatly influence the performance of memory system and the overall data center. We observe a wide performance gap when comparing the worst and best configurations, from 1.4% to 75.1%, for 11 important data center applications. We then develop a tuning framework which attempts to predict the optimal configuration based on hardware performance counters. The framework achieves performance within 1% of the best performance of any single configuration for the same set of applications.

embedded and real-time computing systems and applications | 2009

Zero-Buffer Inter-core Process Communication Protocol for Heterogeneous Multi-core Platforms

Yu Hsien Lin; Chia Heng Tu; Chi-Sheng Shih; Shih-Hao Hung

Executing functional components in pipeline on heterogeneous multi-core platforms can greatly improve the parallelism but require great amount of data communication among processes and threads. Our studies showed that existing inter-process/thread communication protocols consist of many unnecessary memory copies and prolong the execution of the applications on heterogeneous multi-core platforms. NTU ICPC uses polling-base mail notification to unnecessary context switches, and designs a memory subsystem to manage the input and output data between the senders and receivers. The protocol was implemented and evaluated on heterogeneous multi-core platform for several use scenario including H.264 encoding process. The evaluation results show that the communication overhead on sender side is independent of the data size and that on receiver side is greatly shortened, compared to several inter-process Communication (IPC) protocols including mailbox, message queue, and shared memory. When encoding H.264 video clips, the encoding frame rates increase for more than 30%. Our experiments also showed that the communication overhead accounts 40% to 50% of total execution time in average for H.264 video decoding applications. In this paper, we present the design and implementation of zero-buffer inter-core process communication protocol, named NTU ICPC, to shorten communication overhead for pipeline executed applications on heterogeneous multi-core platforms.

ACM Transactions on Design Automation of Electronic Systems | 2014

Performance and power profiling for emulated Android systems

Chia Heng Tu; Hui Hsin Hsu; Jen Hao Chen; Chun Han Chen; Shih-Hao Hung

Simulation is a common approach for assisting system design and optimization. For system-wide optimization, energy and computational resources are often the two most critical issues. Monitoring the energy state of each hardware component and measuring the time spent in each state is needed for accurate energy and performance prediction. For software optimization, it is important to profile the energy and the time consumed by each software construct in a realistic operating environment with a proper workload. However, the conventional approaches of simulation often fail to produce satisfying data. First, building a cycle-accurate simulation environment for a complex system, such as an Android smartphone, is difficult and can take a long time. Second, a slow simulation can significantly alter the behavior of multithreaded, I/O-intensive applications and can affect the accuracy of profiles. Third, existing software-based profilers generally do not work on simulators, which makes it difficult for performance analysis of complicated software, for example, Java applications executed by the Dalvik VM in an Android system. To address these aforementioned problems, we proposed and prototyped a framework, called virtual performance analyzer (VPA). VPA takes advantage of an existing emulator or virtual machine monitor to reduce the complexity of building a simulator. VPA allows the user to selectively and incrementally integrate timing models and power models into the emulator with our carefully designed performance/power monitors, tracing facility, and profiling tools to evaluate and analyze the emulated system. The emulated system can perform at different levels of speed to help verify if the profile data are impacted by the emulation speed. Finally, VPA supports existing software-based profiles and enables non-intrusive tracing/profiling by minimizing the probe effect. Our experimental results show that the VPA framework allows users to quickly establish a performance/power evaluation environment and gather useful information to support system design and software optimization for Android smartphones.

computer and information technology | 2010

A Virtual Timing Device for Program Performance Analysis

Wen Chang Hsu; Shili Hao Hung; Chia Heng Tu

Functional virtual platforms have been popularly used to support system development without needing the actual hardware. While the emulation process is fast enough to model the behaviors of complex systems, performance assessment cannot be done accurately due to the lack of timing models for the simulated systems. To tackle the problem, we proposed a virtual timing device (VTD) for a functional virtual platform to advance simulated clock time based on the hardware/software events observed during the emulation process. As a case study, we implemented the VTD in QEMU, an open-source virtual platform, with a variety of timing algorithms offering trade-offs between the accuracy and speed of timing estimation. With a fast, but less accurate timing algorithm, quick performance analysis can be done on QEMU at approximately 67 million instruction per second and reported execution time for the MiBench with an average of 15.7% error. Highly accurate performance profiles can be obtained by elaborating the timing model, e.g. with the addition of cache simulation, at the cost of simulation speed.

asia and south pacific design automation conference | 2012

System-wide profiling and optimization with virtual machines

Shih-Hao Hung; Tei-Wei Kuo; Chi-Sheng Shih; Chia Heng Tu

Simulation is a common approach for assisting system design and optimization. For system-wide optimization, energy and computational resources are often the two most critical limitations. Modeling energy-states of each hardware component and time spent in each state is needed for accurate energy and performance prediction. Tracking software execution in a realistic operating environment with properly modeled input/output is key to accurate prediction. However, the conventional approaches can have difficulties in practice. First, for a complex system such as an Android smartphone, building a cycle-accurate simulation environment is no easy task. Secondly, for I/O-intensive applications, a slow simulation would significantly alter the application behavior and change its performance profile. Thirdly, conventional software profiling tools generally do not work on simulators, which makes it difficult for performance analysis of complicated software, e.g., Java applications executed by the Dalvik virtual machine. Recently, virtual machine technologies are widely used to emulate a variety of computer systems. While virtual machines do not model the hardware components in the emulated system, we can ease the effort of building a simulation environment by leveraging the infrastructure of virtual machines and adding performance and power models. Moreover, multiple sets of the performance and energy models can be selectively used to verify if the speed of the simulated system impacts the software behavior. Finally, performance monitoring facilities can be integrated to work with profiling tools. We believe this approach should help overcome the aforementioned difficulties. We have prototyped a framework and our case studies showed that the information provided by our tools are useful for software optimization and system design for Android smartphones.

embedded and real-time computing systems and applications | 2010

Designing and Implementing a Portable, Efficient Inter-core Communication Scheme for Embedded Multicore Platforms

Shih-Hao Hung; Wen Long Yang; Chia Heng Tu

In the recent years, multicore processor designs have become increasingly popular for embedded applications, but diversified inter-core communication mechanisms have led to the difficulties in software development, integration and migration. A unified, portable, and efficient inter-core communication mechanism would have helped reduce these difficulties significantly, but such a solution did not exist today. We proposed a scheme called MSG, which provides users with a set of essential message-passing programming interfaces adopted from MPI and MCAPI, including blocking and non-blocking point-to-point communications, one-sided communications, and collective operations. We experimented and evaluated our design methodology with the case study on the IBM CELL, a popular heterogeneous multicore platform. On the CELL platform, our MSG library fitted in the 256KB local memory on each individual processor core and outperformed two existing communication libraries, DaCS and CML. With a systematic approach, we showed how optimization could be done on the CELL platform to improve the performance of the MSG library. Hopefully, our experiences help the design and development of communication libraries for existing and future multicore platforms and embedded applications.

embedded and real-time computing systems and applications | 2008

New Tracing and Performance Analysis Techniques for Embedded Applications

Shih-Hao Hung; Shu Jheng Huang; Chia Heng Tu

Performance evaluation is key to many computer applications. Many techniques and profiling tools are available for measuring performance, but most of them depend on the hardware and the software on which they run. For a new platform, or a platform which is not popular, programmers usually suffer from few analysis tools, which has been a serious problem for application development on many embedded systems. Thus, a performance analysis tool with the software mechanism is quite important for developing embedded applications. This paper describes a software mechanism for analyzing program performance on a wide range of platforms via code instrumentation at the source level. We implement this mechanism in a pure software profiling toolkit, called Module tracer, which works with a public-domain tool, CIL, to carry out code instrumentation for C programs. The toolkit aids programmers in understanding the behavior of applications by generating and analyzing traces and identify potential performance problems.

Journal of Systems Architecture | 2011

A portable, efficient inter-core communication scheme for embedded multicore platforms

Shih-Hao Hung; Chia Heng Tu; Wen Long Yang

Multicore processor designs have become increasingly popular for embedded applications in recent years, but diversified inter-core communication mechanisms have led to the difficulties in software development, integration and migration. A unified, portable, and efficient inter-core communication mechanism would have helped reduce these difficulties significantly, but such a solution did not exist today. We proposed a scheme called MSG, which provides users with a set of essential message-passing programming interfaces adopted from MPI and MCAPI, including blocking and non-blocking point-to-point communications, one-sided communications, and collective operations. We experimented and evaluated our design methodology with case studies on two heterogeneous multicore platforms: IBM CELL and ITRI PAC DUO. On the CELL platform, our MSG library fitted in the 256KB local memory on each individual processor core and outperformed two existing communication libraries, DaCS and CML. On the second case study, we were able to port MSG onto the PAC DUO platform within two weeks upon receiving the platform. With a systematic approach, we showed how optimizations could be done to improve the performance of the MSG libraries. Hopefully, our experiences help the design and development of communication libraries for existing and future multicore platforms.

acm symposium on applied computing | 2015

Migratom.js: a JavaScript migration framework for distributed web computing and mobile devices

Tai Lun Tseng; Shih-Hao Hung; Chia Heng Tu

The emerging HTML5 technologies aim to enhance web apps with increased capabilities on mobile devices, as device-to-device computing becomes important in the future. To enable new application scenarios by making HTML5 execution environment dynamic and efficient, we propose a JavaScript framework Migratom.js, which manages task offloading and code migration with the flow-based programming paradigm. Migratom.js accelerates mobile web apps by offloading compute-intensive tasks to superior computing resources and enables the development of distributed HTML5 applications. This paper describes the design and implementation of Migratom.js and conducts case studies to evaluate the proposed framework. The results show that our framework is suitable for augmenting existing and emerging mobile applications.

asia and south pacific design automation conference | 2010

Trace-based performance analysis framework for heterogeneous multicore systems

Shih-Hao Hung; Chia Heng Tu; Thean Siew Soon

Performance evaluation is key to the optimization of computer applications on multicore systems. While many techniques and profiling tools are available for measuring performance on homogeneous multicore platforms, most of them depend on the hardware support from the vendors. For developing applications on heterogeneous multicore systems, very few analysis tools exist to help the developers. This paper describes a software-based trace collection and performance analysis framework that can be ported to a variety of platforms via code instrumentation at the source level. A pure software profiling toolkit, called ParallelTracer, were implemented based on ANTLR, an open source parser generator, to support this framework. In this paper, we present our framework and toolkit. We use the IBM Cell processor as a case study to demonstrate the capability of ParallelTrace. Our results show that ParallelTracer provided useful information for programmers to understand program behaviors and identify potential performance bottlenecks via graphical visualization. We also discuss the runtime overhead of ParallelTracer. With proper usage, the performance and code size overhead introduced by our toolkit are limited around 19% to 5% and 9%, respectively, for the benchmark program in the case study.

Explore More