Ding-Yong Hong | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ding-Yong Hong is active.

Explore More

Publication

Featured researches published by Ding-Yong Hong.

symposium on code generation and optimization | 2012

HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores

Ding-Yong Hong; Chun Chen Hsu; Pen Chung Yew; Jan Jan Wu; Wei-Chung Hsu; Pangfeng Liu; Chien-Min Wang; Yeh-Ching Chung

Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead before translation; (2) translation and optimization overhead, and (3) translated code quality. On the dynamic binary translator itself, the issues also include its retargetability to support guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs, an important feature for system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, using multithreaded approach to implement DBT. By running the translators and the dynamic binary optimizers on different threads on different cores, it could off-load the overhead caused by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as the support of its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and LLVM (Low Level Virtual Machine) as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU, that it could improve QEMU performance by a factor of 2.4X and 4X on the SPEC 2006 integer and floating point benchmarks for x86 to x86-64 emulations, respectively, i.e. it is only 2.5X and 2.1X slower than native execution of the same benchmarks on x86-64, as opposed to 6X and 8.4X slowdown on QEMU. For ARM to x86-64 emulation, HQEMU could gain a factor of 2.4X speedup over QEMU for the SPEC 2006 integer benchmarks.

international parallel and distributed processing symposium | 2008

Early experiences in application level I/O tracing on blue gene systems

Seetharami R. Seelam; I-Hsin Chung; Ding-Yong Hong; Hui-Fang Wen; Hao Yu

On todays massively parallel processing (MPP) supercomputers, it is increasingly important to understand I/O performance of an application both to guide scalable application development and to tune its performance. These two critical steps are often enabled by performance analysis tools to obtain performance data on thousands of processors in an MPP system. To this end, we present the design, implementation, and early experiences of an application level I/O tracing library and the corresponding tool for analyzing and optimizing I/O performance on Blue Gene (BG) MPP systems. This effort was a part of IBM UPC Toolkit for BG systems. To our knowledge, this is the first comprehensive application-level I/O monitoring, playback, and optimizing tool available on BG systems. The preliminary experiments on popular NPB BTIO benchmark show that the tool is much useful on facilitating detailed I/O performance analysis.

virtual execution environments | 2014

DBILL: an efficient and retargetable dynamic binary instrumentation framework using llvm backend

Yi Hong Lyu; Ding-Yong Hong; Tai Yi Wu; Jan Jan Wu; Wei-Chung Hsu; Pangfeng Liu; Pen Chung Yew

Dynamic Binary Instrumentation (DBI) is a core technology for building debugging and profiling tools for application executables. Most state-of-the-art DBI systems have focused on the same instruction set architecture (ISA) where the guest binary and the host binary have the same ISA. It is uncommon to have a cross-ISA DBI system, such as a system that instruments ARM executables to run on x86 machines. We believe cross-ISA DBI systems are increasingly more important, since ARM executables could be more productively analyzed on x86 based machines such as commonly available PCs and servers. In this paper, we present DBILL, a cross-ISA and re- targetable dynamic binary instrumentation framework that builds on both QEMU and LLVM. The DBILL framework enables LLVM-based static instrumentation tools to become DBI ready, and deployable to different target architectures. Using address sanitizer and memory sanitizer as implementation examples, we show DBILL is an efficient, versatile and easy to use cross-ISA retargetable DBI framework.

international conference on parallel processing | 2011

LnQ: Building High Performance Dynamic Binary Translators with Existing Compiler Backends

Chun Chen Hsu; Pangfeng Liu; Chien-Min Wang; Jan Jan Wu; Ding-Yong Hong; Pen Chung Yew; Wei-Chung Hsu

This paper presents an LLVM+QEMU (LnQ)framework for building high performance and retargetable binary translators with existing compiler modules. Dynamic binary translation is a just-in-time (JIT) compilation from binary code of guest ISA to binary code of host ISA. The quality of translated code is critical to the performance of a dynamic binary translator, which translates code between different IS As, so the translated code is often carefully hand-optimized. As a result, it takes tremendous implementation efforts for software engineers to port an existing dynamic binary translator to anew host ISA. The goal of LnQ framework is to enable the process of building high performance and retarget able dynamic binary translators with existing optimizers and code generation back ends. LnQ framework consists of a translation module and an emulation engine. We deisgn the translation module based on LLVM compiler infrastructure, and use QEMU as our emulation engine. We implement an x86-to-x86 64 dynamic binary translator with our LnQ framework to show that the framework is retarget able, and conduct experiments on SPECCPU2006 benchmarks to show that the resulting binary translator has good perfromance. The experiment results indicate that the x86-to-x86 64 LnQ translator achieves an average speedup of 1.62X in integer benchmarks, and 3.02X in floating point benchmarks than QEMU.

international conference on parallel and distributed systems | 2015

SIMD Code Translation in an Enhanced HQEMU

Sheng-Yu Fu; Ding-Yong Hong; Jan-Jan Wu; Pangfeng Liu; Wei-Chung Hsu

HQEMU is a multi-threaded and retargetable dynamic binary translator built on top of QEMU and LLVM. It combines the fast and reliable code translation in the TCG (Tiny Code Generator) of QEMU and the rich optimizations in LLVM to achieve high performance for both short running and long running applications. One weakness of HQEMU lies in the lack of efficient SIMD instruction translation. This work investigates on how to remedy that. Two approaches have been designed and tested. One simple approach is to modify the help function to emit LLVM vector IR, and a more complete approach is to add a newly introduced vector IR in the TCG phase. Although both approaches can exploit the SIMD instructions of the host machine, the second and more complete approach has superior runtime as well as compile time advantages.

IEEE Transactions on Parallel and Distributed Systems | 2014

Efficient and Retargetable Dynamic Binary Translation on Multicores

Ding-Yong Hong; Jan Jan Wu; Pen Chung Yew; Wei-Chung Hsu; Chun Chen Hsu; Pangfeng Liu; Chien-Min Wang; Yeh-Ching Chung

Dynamic binary translation (DBT) is a core technologyto many important applications such as system virtualization, dynamic binary instrumentation, and security. However, there are several factors that often impede its performance: 1) emulation overhead before translation; 2) translation and optimization overhead; and 3) translated code quality. The issues also include its retargetabilitythat supports guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs-an important feature to system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, and use a multithreaded approach to implement DBT. By running the translator and the dynamic binary optimizer on different cores with different threads, it could off-load the overhead incurred by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and Low-Level Virtual Machine (LLVM) as our building blocks, we demonstrated in a multithreaded DBT prototype, called Hybrid-QEMU (HQEMU), that it could improve QEMU performance by a factor of 2.6x and 4.1x on the SPEC CPU2006 integer and floating point benchmarks, respectively, for dynamic translation of x86 code to run on x86-64 platforms. For ARM codes to x86-64 platforms, HQEMU can gain a factor of 2.5x speedup over QEMU for the SPEC CPU2006 integer benchmarks. We also address the performance scalability issue of multithreaded applications across ISAs. We identify two major impediments to performance scalability in QEMU: 1) coarse-grained locks used to protect shared data structures, and 2) inefficient emulation of atomic instructions across ISAs. We proposed two techniques to mitigate those problems: 1) using indirect branch translation caching (IBTC) to avoid frequent accesses to locks, and 2) using lightweight memory transactions to emulate atomic instructions across ISAs. Our experimental results show that for multithread applications, HQEMU achieves 25X speedups over QEMU for the PARSEC benchmarks.

virtual execution environments | 2013

Improving dynamic binary optimization through early-exit guided code region formation

Chun Chen Hsu; Pangfeng Liu; Jan Jan Wu; Pen Chung Yew; Ding-Yong Hong; Wei-Chung Hsu; Chien-Min Wang

Most dynamic binary translators (DBT) and optimizers (DBO) target binary traces, i.e. frequently executed paths, as code regions to be translated and optimized. Code region formation is the most important first step in all DBTs and DBOs. The quality of the dynamically formed code regions determines the extent and the types of optimization opportunities that can be exposed to DBTs and DBOs, and thus, determines the ultimate quality of the final optimized code. The Next-Executing-Tail (NET) trace formation method used in HP Dynamo is an early example of such techniques. Many existing trace formation schemes are variants of NET. They work very well for most binary traces, but they also suffer a major problem: the formed traces may contain a large number of early exits that could be branched out during the execution. If this happens frequently, the program execution will spend more time in the slow binary interpreter or in the unoptimized code regions than in the optimized traces in code cache. The benefit of the trace optimization is thus lost. Traces/regions with frequently taken early-exits are called delinquent traces/regions. Our empirical study shows that at least 8 of the 12 SPEC CPU2006 integer benchmarks have delinquent traces. In this paper, we propose a light-weight region formation technique called Early-Exit Guided Region Formation (EEG) to improve the quality of the formed traces/regions. It iteratively identifies and merges delinquent regions into larger code regions. We have implemented our EEG algorithm in two LLVM-based multi-threaded DBTs targeting ARM and IA32 instruction set architecture (ISA), respectively. Using SPEC CPU2006 benchmark suite with reference inputs, our results show that compared to an NET-variant currently used in QEMU, a state-of-the-art retargetable DBT, EEG can achieve a significant performance improvement of up to 72% (27% on average), and to 49% (23% on average) for IA32 and ARM, respectively.

international conference on parallel and distributed systems | 2016

Exploiting Longer SIMD Lanes in Dynamic Binary Translation

Ding-Yong Hong; Sheng-Yu Fu; Yu-Ping Liu; Jan-Jan Wu; Wei-Chung Hsu

Recent trends in SIMD architecture have tended toward longer vector lengths and more enhanced SIMD features have been introduced in the newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture, which supports improved parallelism and enhanced vector primitives, and thus only achieve a small fraction of potential peak performance. This paper presents a dynamic binary translation technique that enables short-SIMD binaries to exploit the benefits of the new SIMD architecture by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translators internal translation condition and one general approach using loop peeling optimization. The benchmark results show that an average speedup of 1.45X is achieved for NEON to AVX2 loop transformation.

international conference on computer modelling and simulation | 2010

A Scalable HLA RTI System Based on Multiple-FedServ Architecture

Ding-Yong Hong; Fang Ping Pai; Shih Hsiang Lo; Yeh-Ching Chung

A scalable and high performance RTI (Runtime Infrastructure) system implements a two-layer architecture to supporting large-scale simulation is proposed in this article. The two-layer architecture, Multiple-FedServ, exploits both centralized and distributed way to manage a simulation. In the first layer, each FedServ is in charge of a number of federates and in the second layer, a simulation federation is then formed by all the FedServs. This paper describes how the messages are routed and synchronization performed in this two-layer architecture. An RTI system based on Multiple-FedServ architecture and follows the specification of IEEE 1516 standard is implemented. Performance evaluations of this RTI using standard HLA/RTI benchmarks are presented. We evaluate the latency, throughput and time advancement benchmarks under varied size of federates and varied size of FedServs. Issues such as message routing, multicasting and synchronization are specially addressed in this article. Results show that Multiple-FedServ architecture can scale well.

international conference on parallel architectures and compilation techniques | 2017

Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation

Yu-Ping Liu; Ding-Yong Hong; Jan-Jan Wu; Sheng-Yu Fu; Wei-Chung Hsu

Processor manufacturers have adopted SIMD for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD registers through binary translation raises the issues of asymmetric SIMD register configurations. To date, these issues have been overlooked. As a result, only a small fraction of the potential performance gain is realized due to underutilization of the hosts SIMD parallelism and register capacity.In this paper, we present a novel dynamic binary translation technique called spill-aware SLP (saSLP), which combines short ARMv8 NEON instructions and registers in the guest binary loops to fully utilize the x86 AVX hosts parallelism as well as minimize register spilling. Our experiment results show that saSLP improves the performance by 1.6X (2.3X) across a number of benchmarks, and reduces spilling by 97% (99%) for ARMv8 NEON to x86 AVX2 (AVX-512) translation.

Explore More