Pen Chung Yew | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pen Chung Yew is active.

Explore More

Publication

Featured researches published by Pen Chung Yew.

IEEE Transactions on Computers | 1987

Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

Pen Chung Yew; Nian-Feng Tzeng; Lawrie

When a large number of processors try to access a common variable, referred to as hot-spot accesses in [6], not only can the resulting memory contention seriously degrade performance, but it can also cause tree saturation in the interconnection network which blocks both hot and regular requests alike. It is shown in [6] that even if only a small percentage of all requests are to a hot-spot, these requests can cause very serious performances problems, and networks that do the necessary combining of requests are suggested to keep the interconnection network and memory contention from becoming a bottleneck.

IEEE Transactions on Computers | 1999

The superthreaded processor architecture

Jenn Yuan Tsai; Jian Huang; Christoffer Amlo; David J. Lilja; Pen Chung Yew

The common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism that is available in application programs. The superthreaded processor, on the other hand, is a concurrent multithreaded architecture (CMA) that can exploit the multiple granularities of parallelism that are available in general-purpose application programs. Unlike other CMAs that rely primarily on hardware for run-time dependence detection and speculation, the superthreaded processor combines compiler-directed thread-level speculation of control and data dependences with run-time data dependence verification hardware. This hybrid of a superscalar processor and a multiprocessor-on-a-chip can utilize many of the existing compiler techniques used in traditional parallelizing compilers developed for multiprocessors. Additional unique compiler techniques, such as the conversion of data speculation into control speculation, are also introduced to generate the superthreaded code and to enhance the parallelism between threads. A detailed execution-driven simulator is used to evaluate the performance potential of this new architecture. It is found that a superthreaded processor can achieve good performance on complex application programs through this close coupling of compile-time and run-time information.

international conference on parallel architectures and compilation techniques | 1996

The superthreaded architecture: thread pipelining with run-time data dependence checking and control speculation

Jenn Yuan Tsai; Pen Chung Yew

This peeper presents a new concurrent multiple-threaded architectural model, called superthreading, for exploiting thread-level parallelism on a processor. This architectural model adopts a thread pipelining execution model that allows threads with data dependences and control dependences to be executed in parallel. The basic idea of thread pipelining is to compute and forward recurrence data and possible dependent store addresses to the next thread as soon as possible, so the next thread can start execution and perform run-time data dependence checking. Thread pipelining also forces contiguous threads to perform their memory write-backs in order, which enables the compiler to fork threads with control speculation. With run-time support for data dependence checking and control speculation, the superthreaded architectural model can exploit loop-level parallelism from a broad range of applications.

IEEE Transactions on Software Engineering | 1987

A Scheme to Enforce Data Dependence on Large Multiprocessor Systems

Chuan Qi Zhu; Pen Chung Yew

Enforcement of data dependence in parallel algorithms requires certain synchronization primitives. For simple data dependence, synchronization primitives like Full/Empty bit in HEP machine [5] can be very effective. However, if data dependence cannot be determined at compile time, or if very complicated, more efficient synchronization schemes and algorithms are needed.

IEEE Transactions on Parallel and Distributed Systems | 1990

An empirical study of Fortran programs for parallelizing compilers

Zhiyu Shen; Zhiyuan Li; Pen Chung Yew

Some results are reported from an empirical study of program characteristics, that are important in parallelizing compiler writers, especially in the area of data dependence analysis and program transformations. The state of the art in data dependence analysis and some parallel execution techniques are examined. The major findings are included. Many subscripts contain symbolic terms with unknown values. A few methods of determining their values at compile time are evaluated. Array references with coupled subscripts appear quite frequently; these subscripts must be handled simultaneously in a dependence test, rather than being handled separately as in current test algorithms. Nonzero coefficients of loop indexes in most subscripts are found to be simple: they are either 1 or -1. This allows an exact real-valued test to be as accurate as an exact integer-valued test for one-dimensional or two-dimensional arrays. Dependencies with uncertain distance are found to be rather common, and one of the main reasons is the frequent appearance of symbolic terms with unknown values. >

international conference on software engineering | 2007

POLUS: A POwerful Live Updating System

Haibo Chen; Jie Yu; Rong Chen; Binyu Zang; Pen Chung Yew

This paper presents POLUS, a software maintenance tool capable of iteratively evolving running software into newer versions. POLUSs primary goal is to increase the dependability of contemporary server software, which is frequently disrupted either by external attacks or by scheduled upgrades. To render POLUS both practical and powerful, we design and implement POLUS aiming to retain backward binary compatibility, support for multithreaded software and recover already tainted state of running software, yet with good usability and very low runtime overhead. To demonstrate the applicability of POLUS, we report our experience in using POLUS to dynamically update three prevalent server applications: vsftpd, sshd and apache HTTP server. Performance measurements show that POLUS incurs negligible runtime overhead: a less than 1% performance degradation (but 5% for one case). The time to apply an update is also minimal.

IEEE Transactions on Parallel and Distributed Systems | 1990

An efficient data dependence analysis for parallelizing compilers

Zhiyuan Li; Pen Chung Yew; Chuan Qi Zhu

A novel algorithm, called the lambda test, is presented for an efficient and accurate data dependence analysis of multidimensional array references. It extends the numerical methods to allow all dimensions of array references to be tested simultaneously. Hence, it combines the efficiency and the accuracy of both approaches. This algorithm has been implemented in Parafrase, a Fortran program parallelization restructurer developed at the University of Illinois at Urbana-Champaign. Some experimental results are presented to show its effectiveness. >

virtual execution environments | 2006

Live updating operating systems using virtualization

Haibo Chen; Rong Chen; Fengzhe Zhang; Binyu Zang; Pen Chung Yew

Many critical IT infrastructures require non-disruptive operations. However, the operating systems thereon are far from perfect that patches and upgrades are frequently applied, in order to close vulnerabilities, add new features and enhance performance. To mitigate the loss of availability, such operating systems need to provide features such as live update through which patches and upgrades can be applied without having to stop and reboot the operating system. Unfortunately, most current live updating approaches cannot be easily applied to existing operating systems: some are tightly bound to specific design approaches (e.g. object-oriented); others can only be used under particular circumstances (e.g. quiescence states).In this paper, we propose using virtualization to provide the live update capability. The proposed approach allows a broad range of patches and upgrades to be applied at any time without the requirement of a quiescence state. Moreover, such approach shares good portability for its OS-transparency and is suitable for inclusion in general virtualization systems. We present a working prototype, LUCOS, which supports live update capability on Linux running on Xen virtual machine monitor. To demonstrate the applicability of our approach, we use real-life kernel patches from Linux kernel 2.6.10 to Linux kernel 2.6.11, and apply some of those kernel patches on the fly. Performance measurements show that our implementation incurs negligible performance overhead: a less than 1% performance degradation compared to a Xen-Linux. The time to apply a patch is also very minimal.

international symposium on microarchitecture | 2003

The performance of runtime data cache prefetching in a dynamic optimization system

Jiwei Lu; Howard Chen; Rao Fu; Wei-Chung Hsu; Bobbie Othmer; Pen Chung Yew; Dong-Yuan Chen

Traditional software controlled data cache prefetching is often ineffective due to the lack of runtime cache miss and miss address information. To overcome this limitation, we implement runtime data cache prefetching in the dynamic optimization system ADORE (ADaptive Object code Reoptimization). Its performance has been compared with static software prefetching on the SPEC2000 benchmark suite. Runtime cache prefetching shows better performance. On an Itanium 2 based Linux workstation, it can increase performance by more than 20% over static prefetching on some benchmarks. For benchmarks that do not benefit from prefetching, the runtime optimization system adds only 1%-2% overhead. We have also collected cache miss profiles to guide static data cache prefetching in the ORC compiler. With that information the compiler can effectively avoid generating prefetches for loops that hit well in the data cache.

conference on high performance computing (supercomputing) | 1994

An efficient algorithm for the run-time parallelization of DOACROSS loops

Ding Kai Chen; Josep Torrellas; Pen Chung Yew

While automatic parallelization of loops usually relies on compile time analysis of data dependences, for some loops the data dependences cannot be determined at compile time. An example is loops accessing arrays with subscripted subscripts. To parallelize these loops, it is necessary to perform run time analysis. We present a new algorithm to parallelize these loops at run time. Our scheme handles any type of data dependence in the loop without requiring any special architectural support in the multiprocessor. Furthermore, compared to an older scheme with the same generality, our scheme significantly reduces the amount of processor communication required and increases the overlap among dependent iterations. We evaluate our algorithm with parameterized loops running on the 32-processor Cedar shared memory multiprocessor. The results show speedups over the serial code of up to 14 with the full overhead of run time analysis and of up to 27 if part of the analysis is reused across loop invocations. Moreover, the algorithm outperforms the older scheme in nearly all cases, reaching speedups of up to times when the loop has many dependences.<<ETX>>

Explore More