Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Perry H. Wang is active.

Publication


Featured researches published by Perry H. Wang.


programming language design and implementation | 2007

EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Perry H. Wang; Jamison D. Collins; Gautham N. Chinya; Hong Jiang; Xinmin Tian; Milind Girkar; Nick Y. Yang; Guei-Yuan Lueh; Hong Wang

Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power. We have prototyped the EXO architecture on a physical heterogeneous platform consisting of an Intel® Core™ 2 Duo processor and an 8-core 32-thread Intel® Graphics Media Accelerator X3000. In addition, we have implemented the CHI integrated programming environment with the Intel® C++ Compiler, runtime toolset, and debugger. On the EXO prototype system, we have enhanced a suite of production-quality media kernels for video and image processing to utilize the accelerator through the CHI programming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone.


programming language design and implementation | 2002

Post-pass binary adaptation for software-based speculative precomputation

Steve Shih-wei Liao; Perry H. Wang; Hong Wang; Gerolf F. Hoflehner; Daniel M. Lavery; John Paul Shen

Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch threads. Software-based speculative precomputation (SSP) is one such technique, proposed for multithreaded Itanium models. SSP does not require expensive hardware support-instead it relies on the compiler to adapt binaries to perform prefetching on otherwise idle hardware thread contexts at run time. This paper presents a post-pass compilation tool for generating SSP-enhanced binaries. The tool is able to: (1) analyze a single-threaded application to generate prefetch threads; (2) identify and embed trigger points in the original binary; and (3) produce a new binary that has the prefetch threads attached. The execution of the new binary spawns the speculative prefetch threads, which are executed concurrently with the main thread. Our results indicate that for a set of pointer-intensive benchmarks, the prefetching performed by the speculative threads achieves an average of 87% speedup on an in-order processor and 5% speedup on an out-of-order processor.


international symposium on computer architecture | 2006

Multiple Instruction Stream Processor

Richard A. Hankins; Gautham N. Chinya; Jamison D. Collins; Perry H. Wang; Ryan N. Rakvic; Hong Wang; John Paul Shen

Microprocessor design is undergoing a major paradigm shift towards multi-core designs, in anticipation that future performance gains will come from exploiting threadlevel parallelism in the software. To support this trend, we present a novel processor architecture called the Multiple Instruction Stream Processing (MISP) architecture. MISP introduces the sequencer as a new category of architectural resource, and defines a canonical set of instructions to support user-level inter-sequencer signaling and asynchronous control transfer. MISP allows an application program to directly manage user-level threads without OS intervention. By supporting the classic cache-coherent shared-memory programming model, MISP does not require a radical shift in the multithreaded programming paradigm. This paper describes the design and evaluation of the MISP architecture for the IA-32 family of microprocessors. Using a research prototype MISP processor built on an IA-32-based multiprocessor system equipped with special firmware, we demonstrate the feasibility of implementing the MISP architecture. We then examine the utility of MISP by (1) assessing the key architectural tradeoffs of the MISP architecture design and (2) showing how legacy multithreaded applications can be migrated to MISP with relative ease.


symposium on code generation and optimization | 2004

Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors

Dongkeun Kim; Steve Shih-wei Liao; Perry H. Wang; J. del Cuvillo; Xinmin Tian; Xiang Zou; Hong Wang; Donald Yeung; Milind Girkar; John Paul Shen

Pre-execution techniques have received much attention as an effective way of prefetching cache blocks to tolerate the ever-increasing memory latency. A number of pre-execution techniques based on hardware, compiler, or both have been proposed and studied extensively by researchers. They report promising results on simulators that model a simultaneous multithreading (SMT) processor. We apply the helper threading idea on a real multithreaded machine, i.e., Intel Pentium 4 processor with hyper-threading technology, and show that indeed it can provide wall-clock speedup on real silicon. To achieve further performance improvements via helper threads, we investigate three helper threading scenarios that are driven by automated compiler infrastructure, and identify several key challenges and opportunities for novel hardware and software optimizations. Our study shows a program behavior changes dynamically during execution. In addition, the organizations of certain critical hardware structures in the hyper-threaded processors are either shared or partitioned in the multithreading mode and thus, the tradeoffs regarding resource contention can be intricate. Therefore, it is essential to judiciously invoke helper threads by adapting to the dynamic program behavior so that we can alleviate potential performance degradation due to resource contention. Moreover, since adapting to the dynamic behavior requires frequent thread synchronization, having light-weight thread synchronization mechanisms is important.


field programmable gate arrays | 2009

Intel nehalem processor core made FPGA synthesizable

Graham Schelle; Jamison D. Collins; Ethan Schuchman; Perry H. Wang; Xiang Zou; Gautham N. Chinya; Ralf Plate; Thorsten Mattner; Franz Olbrich; Per Hammarlund; Ronak Singhal; Jim Brayton; Sebastian Steibl; Hong Wang

We present an FPGA-synthesizable version of the Intel Atom processor core, synthesized to a Virtex-5 based FPGA emulation system. To make the production Atom design in SystemVerilog synthesizable through industry standard EDA tool flow, we transformed and mapped latches in the design, converted clock gating, and replaced nonsynthesizable constructs with FPGA-synthesizable counterparts. Additionally, as the target FPGA emulator is hosted on a PC platform with the Pentium-based CPU socket that supports a significantly different front side bus (FSB) protocol from that of the Atom processor, we replaced the existing bus control logic in the Atom core with an alternate FSB protocol to communicate with the rest of the PC platform. With these efforts, we succeeded in synthesizing the entire Atom processor core to fit within a single Virtex-5 LX330 FPGA. The synthesizable Atom core runs at 50Mhz on the Pentium PC motherboard with fully functional I/O peripherals. It is capable of booting off-the-shelf MS-DOS, Windows XP and Linux operating systems, and executing standard x86 workloads.


international conference on parallel architectures and compilation techniques | 2008

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Henry Wong; Anne Bracy; Ethan Schuchman; Tor M. Aamodt; Jamison D. Collins; Perry H. Wang; Gautham N. Chinya; Ankur Khandelwal Groen; Hong Jiang; Hong Wang

Moores Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically “fuses” existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8×.


high performance computer architecture | 2001

Register renaming and scheduling for dynamic execution of predicated code

Perry H. Wang; Hong Wang; Ralph-Michael Kling; Kalpana Ramakrishnan; John Paul Shen

To achieve higher processor performance requires greater synergy between advanced hardware features and innovative compiler techniques. Recent advancement in compilation techniques for predicated execution has provided significant opportunity in exploiting instruction level parallelism. However, little research has been done on how to efficiently execute predicated code in a dynamic microarchitecture. In this paper, we evaluate hardware optimizations for executing predicated code on a dynamically scheduled microarchitecture. We provide two novel ideas to improve the efficiency of executing predicated code. On a generic Intel Itanium processor pipeline model, we demonstrate that, with some microarchitecture enhancements, a dynamic execution processor can achieve about 16% performance improvement over an equivalent static execution processor.


high-performance computer architecture | 2002

Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation

Perry H. Wang; Hong Wang; Jamison D. Collins; Ed Grochowski; Ralph-Michael Kling; John Paul Shen

The performance of in-order execution Itanium/sup TM/ processors can suffer significantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors. One uses an out-of-order (OOO) execution core; the other assumes multithreading support and exploits cache prefetching via speculative precomputation (SP). This paper evaluates and contrasts these two approaches. In addition, this paper assesses the effectiveness of combining the two approaches. For a select set of memory-intensive programs, an in-order SMT Itanium processor using speculative precomputation can achieve performance improvement (92%) comparable to that of an out-of-order design (87%). Applying both 000 and SP yields a total performance improvement of 141% over the baseline in-order machine. OOO tends to be effective in prefetching-for L1 misses; whereas SP is primarily good at covering L2 and L3 misses. Our analysis indicates that the two approaches can be redundant or complementary depending on the type of delinquent loads that each targets. Both approaches are effective on delinquent loads in the loop body; however only SP is effective on delinquent loads found in loop control code.


international symposium on microarchitecture | 2004

Helper threads via virtual multithreading

Perry H. Wang; Jamison D. Collins; Dongkeun Kim; Bill Greene; Kai-Ming Chan; A.B. Yunus; Terry Sych; Stephen F. Moore; John Paul Shen; Hong Wang

Memory latency dominates the performance of many applications on modern processors, despite advances in caches and prefetching techniques. Numerous prefetching techniques, both in hardware and software, try to alleviate the memory bottleneck. One such technique, known as helper threading improves single-thread performance on a simultaneous multithreaded architecture (SMT), which shares processor resources, including caches, among logical threads. It uses otherwise idle hardware thread contexts to execute speculative threads on behalf of the main thread. Helper threading accelerates a program by exploiting a processors multithreading capability to run assist threads. Based on the helper threading usage model, virtual multithreading (VMT), a form of switch-on-event user-level multithreading, can improve performance for real-world workloads with a wall-clock speedup of 5.0 to 38.5 percent


international parallel and distributed processing symposium | 2005

A dependency chain clustered micro architecture

Satish Narayanasamy; Hong Wang; Perry H. Wang; John Paul Shen; Brad Calder

In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. This is made possible by assuming support for executing compiler-constructed traces. One trace is executed at a time by executing its coarse-grained dependency chains (DCs) in different in-order clusters. Since the DCs of a trace are mutually data independent of each other they can be executed in different clusters without any direct communication between them. To execute DCs in narrower clusters without compromising ILP, a compiler algorithm that splits large DCs by duplicating instructions is proposed. Through cycle accurate simulations we show that a DC processor with one 3-wide, one 2-wide and one 1-wide in-order pipeline, could achieve performance equivalent to a 6-wide in-order superscalar processor. Since a clustered DC microarchitecture is complexity efficient, it is amenable to higher clock frequencies and will also be easier to design and validate than a 6-wide monolithic design.

Researchain Logo
Decentralizing Knowledge