Rong-Guey Chang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rong-Guey Chang is active.

Explore More

Publication

Featured researches published by Rong-Guey Chang.

international symposium on vlsi design, automation and test | 2006

Efficient Hardware/Software Partitioning Approach for Embedded Multiprocessor Systems

Tzong-Yen Lin; Yu-ting Hung; Rong-Guey Chang

This paper presents an efficient hardware/software partitioning approach that target at embedded multiprocessor systems with time, area, and power constraints. Our approach is performed in two phases. In the partitioning phase, for an embedded system with p software components and q hardware components, recursive spectral bisection (RSB) has been used to partition an application program into n well-balanced and connected blocks with low communication cost. These blocks are mapped into software components and then tasks in software components will be moved to hardware components to meet the deadline constraint. In the scheduling phase, we derive an approach to reduce the system cost in connection with power and hardware area by exchanging tasks between software and hardware components. Experimental results show that our approach is effective for embedded multiprocessor systems

ACM Transactions on Architecture and Code Optimization | 2010

DisIRer: Converting a retargetable compiler into a multiplatform binary translator

Yuan-Shin Hwang; Tzong-Yen Lin; Rong-Guey Chang

This article proposes an alternative yet effective way of constructing a multiplatform binary translator, by converting a retargetable compiler into a binary translator. The rationale is that a retargetable compiler usually parses source programs into an Intermediate Representation (IR), and then translates IR into object code of different targets after performing analysis and optimizations. Specifically, the mechanism of code generation for multiple platforms from IR is already in place, and the missing link of building a multiplatform binary translator is a tool to transform binary programs back into IR. In order to fill in this missing link, this article presents a tool, called the disIRer. Just as a translator from machine language to assembly language is called a disassembler, a tool that translates executable binary programs to IR is called here a disIRer. The unique feature of this approach is that the retargetability of the binary translator is inherited directly from the retargetable compiler. A prototype multiplatform binary translator has been implemented upon GCC (the GNU Compiler Collection). DisIRer first converts binary programs back into GCC IR (Intermediate Representation), and afterward the GCC backend translates the IR to target binary programs of specified platforms. Experimental results show that x86 binary programs can be translated by this technique into ARM and Alpha binaries with reasonable code density and quality.

Future Generation Computer Systems | 2014

Power-aware code scheduling assisted with power gating and DVS

Cheng-Yu Lee; Tzong-Yen Lin; Rong-Guey Chang

Traditionally, code scheduling is used to optimize the performance of an application, because it can rearrange the code to allow the execution of independent instructions in parallel based on instruction level parallelism (ILP). According to our observations, it can also be applied to reduce power dissipation by taking advantage of the properties of existing low-power techniques. In this paper, we present a power-aware code scheduling (PACS), which is a code scheduling integrated with power gating (PG) and dynamic voltage scaling (DVS) to reduce power consumption while executing an application. In other words, from the viewpoint of compilation optimization, PG and DVS can be applied simultaneously to a code and their impact can be enhanced by code scheduling to further save power. The result shows that when compared with hardware power gating, the proposed PACS can outperform by more than 33% and 41% in terms of energy delay product and energy delay^2 product for DSPStone and Mediabench.

The Journal of Supercomputing | 2015

A priority scheduling for TM pathologies

Chia-Jung Chen; Rong-Guey Chang

Developing a parallel program on Chip multi-processors (CMPs) is a critical and difficult issue. To overcome the synchronization obstacles of CMPs, transactional memory (TM) has been proposed as an alternative control concurrency mechanism, instead of using traditional lock synchronization. Unfortunately, TM has led to seven performance pathologies: DuelingUpgrades, FutileStall, StarvingWriter, StarvingElder, SerializedCommit, RestartConvoy, and FriendlyFire. Such pathologies degrade performance during the interaction between workload and system. Although this performance issue can be solved by hardware, the software solution remains elusive. This paper proposes a priority scheduling algorithm to remedy these performance pathologies. By contrast, the proposed approach can not only solve this issue, but also almost achieve the same performance as hardware transactional memory systems.

Proceedings of the 21st European MPI Users' Group Meeting on | 2014

An Adaptive Zero-Copy Strategy for Ubiquitous High Performance Computing

Ting-Hsuan Chien; Chia-Jung Chen; Rong-Guey Chang

Using a heterogeneous platform has became a common approach to improve performance. This trend leads to critical problems of synchronization and superfluous memory copy construct, thus a memory model can be built to overcome this performance bottleneck via data exchange. Consequently, keeping the benefits of service consolidation while not losing performance and IO efficiency has become a crucial issue. In this paper, we present an adaptive strategy to solve the issue above in an efficient way. We first identify the critical overheads on Linux and and then presents a clever software zero-copy strategy to construct a prototype. Finally, we can optimize the work and leverage the hardware features to reduce the overhead of data copy. The result shows that the proposed prototype is an adaptive approach to perform service consolidation.

embedded and ubiquitous computing | 2010

Trading Conditional Execution for More Registers on ARM Processors

Huang-Jia Cheng; Yuan-Shin Hwang; Rong-Guey Chang; Cheng-Wei Chen

Conditional execution is an important ISA feature of the ARM series of processors. Every instruction can be made to execute conditionally, that is, it is treated as a NOP if the condition is not met. The advantage of conditional execution is that it can maintain high performance while reducing hardware complexity since it can avoid introducing pipeline bubbles even when no branch prediction units are needed. However, conditional execution takes up precious instruction space as conditions are encoded into a 4-bit condition code selector on every 32-bit ARM instruction. Besides, only small percentages of instructions are actually conditionalized in modern embedded applications, and conditional execution might not even lead to performance improvement on modern embedded processors. This paper proposes to trade conditional execution for more ISA registers on ARM processors, and the 4-bit condition field will be used to encode the extra registers. GCC has been ported to generate ARM code with the new instruction format and experimental results have shown that performance can be improved by 6% on average for Media Bench II benchmarks when the number of ISA registers is extended from 16 to 32.

embedded and ubiquitous computing | 2006

Optimizing code size for embedded real-time applications

Shao-Yang Wang; Chih-Yuan Chen; Rong-Guey Chang

This paper presents an efficient technique for code compression. In our work, a sequence of instructions that occurs repeatedly in an application will be compressed to reduce its code size. During compression, each instruction is first divided into the operation part and the register part, and then only the operation part is compressed. For reducing the run-time overhead, we propose an instruction prefetching mechanism to speed the decompression. Moreover, we devise some optimization techniques to improve the code size reduction and the performance, and show their impacts. The experimental results show that our work can achieve a code size reduction of 33% on average and a low overhead in the process of decompression at run time for these benchmarks

international conference on algorithms and architectures for parallel processing | 2015

A Cyber Physical System with GPU for CNC Applications

Jen-Chieh Chang; Ting-Hsuan Chien; Rong-Guey Chang

In this paper, we parallelize the collision detection of five-axis machining as an example to show how to execute CNC applications on Graphics Processing Unit (GPU). We first design and implement an efficient collision detection tool, including the kinematics analyses for five-axis motions, separating axis method for collision detection, and computer simulation for verification. The machine structure is modeled as STL format in CAD software. The input to the detection system is the g-code part program, which describes the tool motions to produce the part surface. Then the g-code will be partitioned and be executed by our collision detection tool in parallel on Graphics Processing Unit (GPU). The system simulates the five-axis CNC motion for tool trajectory and detects any collisions according to the input g-codes. The result shows that our method can improve the performance of computational efficiency significantly when comparing to the conventional detection method.

Archive | 2013

Low Power Compiler Optimization for Pipelining Scaling

Jen-Chieh Chang; Cheng-Yu Lee; Chia-Jung Chen; Rong-Guey Chang

Low power has played an increasingly important role for embedded systems. To save power, lowering voltage and frequency is very straightforward and effective; therefore dynamic voltage scaling (DVS) has become a prevalent low-power technique. However, DVS makes no effect on power saving when the voltage reaches a lower bound. Fortunately, a technique called dynamic pipeline scaling (DPS) can overcome this limitation by switching pipeline modes at low-voltage level. Approaches proposed in previous work on DPS were based on hardware support. From viewpoint of compiler, little has been addressed on this issue. This paper presents a DPS optimization technique at compiler time to reduce power dissipation. The useful information of an application is exploited to devise an analytical model to assess the cost of enabling DPS mechanism. As a consequence we can determine the switching timing between pipeline modes at compiler time without causing significant run-time overhead. The experimental result shows that our approach is effective in reducing energy consumption.

international conference on algorithms and architectures for parallel processing | 2011

Compiler support for concurrency synchronization

Tzong-Yen Lin; Cheng-Yu Lee; Chia-Jung Chen; Rong-Guey Chang

How to write a parallel program is a critical issue for Chip multi-processors (CMPs). To overcome the communication and synchronization obstacles of CMPs, transactional memory (TM) has been proposed as an alternative for controlling concurrency mechanism. Unfortunately, TM has led to seven performance pathologies: DuelingUpgrades, FutileStall, StarvingWriter, StarvingElder, SerializedCommit, RestartConvoy, and FriendlyFire. Such pathologies degrade performance during the interaction between workload and system. Although this performance issue can be solved by hardware, the software solution remains elusive. This paper proposes a priority scheduling algorithm to remedy these performance pathologies. By contrast, the proposed approach can not only solve this issue, but also achieve higher performance than hardware transactional memory (HTM) systems on some benchmarks.

Explore More