Chien-Min Wang
Academia Sinica
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chien-Min Wang.
symposium on code generation and optimization | 2012
Ding-Yong Hong; Chun Chen Hsu; Pen Chung Yew; Jan Jan Wu; Wei-Chung Hsu; Pangfeng Liu; Chien-Min Wang; Yeh-Ching Chung
Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead before translation; (2) translation and optimization overhead, and (3) translated code quality. On the dynamic binary translator itself, the issues also include its retargetability to support guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs, an important feature for system virtualization. In this work, we take advantage of the ubiquitous multicore platforms, using multithreaded approach to implement DBT. By running the translators and the dynamic binary optimizers on different threads on different cores, it could off-load the overhead caused by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as the support of its retargetability. Using QEMU (a popular retargetable DBT for system virtualization) and LLVM (Low Level Virtual Machine) as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU, that it could improve QEMU performance by a factor of 2.4X and 4X on the SPEC 2006 integer and floating point benchmarks for x86 to x86-64 emulations, respectively, i.e. it is only 2.5X and 2.1X slower than native execution of the same benchmarks on x86-64, as opposed to 6X and 8.4X slowdown on QEMU. For ARM to x86-64 emulation, HQEMU could gain a factor of 2.4X speedup over QEMU for the SPEC 2006 integer benchmarks.
Future Generation Computer Systems | 2010
Chien-Min Wang; Hsi-Min Chen; Chun-Chen Hsu; Jonathan Lee
A Grid system is comprised of large sets of heterogeneous and geographically distributed resources that are aggregated as a virtual computing platform for executing large-scale scientific applications. As the number of resources in Grids increases rapidly, selecting appropriate resources for jobs has become a crucial issue. To avoid single point of failure and server overload problems, bidding provides an alternative means of resource selection in distributed systems. However, under the bidding model, the key challenge of resource selection is that there is no global information system to facilitate optimum decision-making; hence requesters can only obtain partial information revealed by resource providers. To address this problem, we propose a set of resource selection heuristics to minimize the turnaround time in a non-reserved bidding-based Grid environment, while considering the level of information about competing jobs revealed by providers. We also present the results of experiments conducted to evaluate the performance of the proposed heuristics.
ieee virtual reality conference | 2003
Jiung-Yao Huang; Yi-chang Du; Chien-Min Wang
We identified an important issue when supporting a large scale networked virtual environment (NVE) with a server cluster. This issue is similar to the process migration issue on the parallel computing study and we refer it as the avatar migration problem. That is, when an avatar of an NVE is moving from one region managed by a server to another region managed by a different server, the client site may perceive abrupt screen change due to the different contents managed by these two servers. This paper proposes equations to solve this problem and elaborates the proposed avatar migration mechanism with state diagrams. The implementing architecture is also given in this paper. Our experiments that successfully show the efficiency of the proposed mechanism are given at the last.
cluster computing and the grid | 2006
Chien-Min Wang; Chun-Chen Hsu; Hsi-Min Chen; Jan-Jan Wu
As the number of data-intensive applications increases in various domains, scientists need to save, retrieve, and analyze increasingly large datasets. The huge volume of data and the long latency of data transfer on the Internet make it very difficult to ensure high-performance access to data grids. Thus, data replication techniques have been widely adopted to solve the latency problem. In this paper, we propose an efficient data replication algorithm for multi-source data transfer, whereby a data replica can be assembled in parallel from multiple distributed data sources and adapted to the variability of network bandwidths. The experimental results show that the proposed algorithm can obtain more aggregated bandwidth, reduce connection overheads, and achieve superior load balance.
international conference on software engineering | 1998
Tyng-Ruey Chuang; Y. S. Kuo; Chien-Min Wang
We describe the design and implementation of system architecture to support object introspection in C++. In this system, information is collected by parsing class declarations, and used to build a supporting environment for object introspection. Our approach is nonintrusive because it requires no change to the original class declarations and libraries, and it guarantees compatibility between objects before and after the addition of introspective capability. This is critical if one wants to integrate third-party class libraries, which are often supplied as black boxes and allow no modification, into highly dynamic applications. We show two applications: automatic I/O support for C++ objects; and an interactive exercise of dynamically loaded C++ class libraries.
IEEE Transactions on Parallel and Distributed Systems | 1992
Chien-Min Wang; Sheng-De Wang
An important issue for the efficient use of multiprocessor systems is the assignment of parallel processors to nested parallel loops. It is desirable for a processor assignment algorithm to be fast and always generate an optimal processor assignment. The paper proposes two efficient algorithms to decide the optimal number of processors assigned to each individual loop. Efficient parallel counterparts of these two algorithms are also presented. These algorithms not only always generate an optimal processor assignment, but also are much faster than the exiting optimal algorithm in the literature. The paper discusses improving the performance of parallel execution by transforming a nested parallel loop into a semantically equivalent one. Three loop transformations are investigated. It is observed that, in most cases, the parallel execution time is improved after applying these transformations. >
Journal of Parallel and Distributed Computing | 1996
Yeong-Sheng Chen; Sheng-De Wang; Chien-Min Wang
In this paper, an approach to tiling nested loops for maximizing parallelism is proposed. The proposed method aims at aggregating independent computations of a loop nest into rectangular blocks and maximizing the block sizes for maximizing parallelism. At first, all the independent computations that can be executed in the first time unit are identified. These computations are called the initially independent computations. Then it is shown that all of them can be collected as a union of rectangular blocks. So, based on these, the entire iteration space of the loops is partitioned into rectangular blocks for maximizing parallelism. The proposed method is formulated as systematic procedures which can easily be implemented in a parallelizing compiler. It is shown that when the wavefront transformation is combined with the proposed method, the loops can always be tiled so that the tile size is greater than one. In comparison with previous work on tiling, the proposed method is shown to have several advantages as summarized in the conclusions of this paper.
international conference on supercomputing | 1995
Chien-Min Wang; Chiu-Yu Ku
For massively parallel computing mechanism, broadcasting is widely used in a variety of applications. When the computation is distributed among the powerful processors, communication overhead always limits the speedup. To reduce the influence of communication latency, a variety of routing methods have been discussed. In this paper, we propose an efficient routing method of broadcasting on all-port worrnhole-routed hybercubes. By exploiting the distance-insensitivity of wormhole switching, this efficient broadcasting algorithm reduce communication latency to l_rzl_logz(n+l )11steps in an n-dimensional hypercube, which is much better than previous results and very close to the theoretical optimum, [rz/log2(n+l )1.Further, by adopting the Hamming code, a well-known algebraic approach in coding theory, this proposed algorithm is contention-free in each broadcasting step.
international conference on parallel processing | 2011
Chun Chen Hsu; Pangfeng Liu; Chien-Min Wang; Jan Jan Wu; Ding-Yong Hong; Pen Chung Yew; Wei-Chung Hsu
This paper presents an LLVM+QEMU (LnQ)framework for building high performance and retargetable binary translators with existing compiler modules. Dynamic binary translation is a just-in-time (JIT) compilation from binary code of guest ISA to binary code of host ISA. The quality of translated code is critical to the performance of a dynamic binary translator, which translates code between different IS As, so the translated code is often carefully hand-optimized. As a result, it takes tremendous implementation efforts for software engineers to port an existing dynamic binary translator to anew host ISA. The goal of LnQ framework is to enable the process of building high performance and retarget able dynamic binary translators with existing optimizers and code generation back ends. LnQ framework consists of a translation module and an emulation engine. We deisgn the translation module based on LLVM compiler infrastructure, and use QEMU as our emulation engine. We implement an x86-to-x86 64 dynamic binary translator with our LnQ framework to show that the framework is retarget able, and conduct experiments on SPECCPU2006 benchmarks to show that the resulting binary translator has good perfromance. The experiment results indicate that the x86-to-x86 64 LnQ translator achieves an average speedup of 1.62X in integer benchmarks, and 3.02X in floating point benchmarks than QEMU.
acm symposium on applied computing | 2008
Po-Chi Shih; Hsi-Min Chen; Yeh-Ching Chung; Chien-Min Wang; Ruay-Shiung Chang; Ching-Hsien Hsu; Kuo-Chan Huang; Chao-Tung Yang
Taiwan UniGrid (Taiwan <u>Uni</u>versity <u>Grid</u>) is a Grid computing platform, which is founded by a community of educational and research organizations interested in Grid computing technologies in Taiwan. In this paper, we present the design and development of a middleware for Taiwan UniGrid. Taiwan UniGrid middleware consists of three primary modules: 1) UniGrid Portal, 2) Computing Service, and 3) Data Service. We explain the major design issues that we suffered from the development of these three modules and propose the corresponding approaches to them. The detailed system architecture, software components and features are elaborated. Finally, an example of a workflow consisting of MPI parallel jobs demonstrates that users can utilize Grid resources with ease via our middleware platform.