Is this you? Create Your Porfile

Weiwu Hu

Chinese Academy of Sciences

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Weiwu Hu is active.

Explore More

Publication

Featured researches published by Weiwu Hu.

international symposium on microarchitecture | 2009

Godson-3: A Scalable Multicore RISC Processor with x86 Emulation

Weiwu Hu; Jian Wang; Xiang Gao; Yunji Chen; Qi Liu; Guojie Li

The Godson-3 microprocessor aims at high-throughput server applications, high-performance scientific computing, and high-end embedded applications. It offers a scalable network on chip, hardware support for x86 emulation, and a reconfigurable architecture. The four-core Godson-3 chip is fabricated with 65-nm CMOS technology. Eight- and 16-core Godson-3 chips are in development.

ieee international conference on high performance computing data and analytics | 1999

JIAJIA: A Software DSM System Based on a New Cache Coherence Protocol

Weiwu Hu; Weisong Shi; Zhimin Tang

This paper describes design and evaluation of a software distributed shared memory (DSM) system called JIAJIA. JIAJIA is a home-based software DSM system in which physical memories of multiple computers are combined to form a larger shared space. It implements the lock-based cache coherence protocol which totally eliminates directory and maintains coherence through accessing write notices kept on the lock. Our experiments with some widely accepted DSM benchmarks such as SPLASH2 program suite and NAS Parallel Benchmarks indicate that, compared to recent software DSMs such as CVM, higher performance is achieved by JIAJIA. Besides, JIAJIA can solve large problems that cannot be solved by other software DSMs due to memory size limitation.

Journal of Computer Science and Technology | 2005

Microarchitecture of the Godson-2 processor

Weiwu Hu; Fuxin Zhang; Zusong Li

The Godson project is the first attempt to design high performance general-purpose microprocessors in China. This paper introduces the microarchitecture of the Godson-2 processor which is a 64-bit, 4-issue, out-of-order execution RISC processor that implements the 64-bit MIPS-like instruction set. The adoption of the aggressive out-of-order execution techniques (such as register mapping, branch prediction, and dynamic scheduling) and cache techniques (such as non-blocking cache, load speculation, dynamic memory disambiguation) helps the Godson-2 processor to achieve high performance even at not so high frequency. The Godson-2 processor has been physically implemented on a 6-metal 0.18μm CMOS technology based on the automatic placing and routing flow with the help of some crafted library cells and macros. The area of the chip is 6,700 micrometers by 6,200 micrometers and the clock cycle at typical corner is 2.3ns.

international symposium on computer architecture | 2010

LReplay: a pending period based deterministic replay scheme

Yunji Chen; Weiwu Hu; Tianshi Chen; Ruiyang Wu

Debugging parallel program is a well-known difficult problem. A promising method to facilitate debugging parallel program is using hardware support to achieve deterministic replay. A hardware-assisted deterministic replay scheme should have a small log size, as well as low design cost, to be feasible for adopting by industrial processors. To achieve the goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information [6]. According to the experimental results on Godson-3, the overall log size of LReplay is about 0.55B/K-Inst (byte per k-instruction) for sequential consistency, and 0.85B/K-Inst for Godson-3 consistency. The log size is smaller in an order of magnitude than state-of-art deterministic replay schemes incuring no performance loss. Furthermore, LReplay only consumes about

high-performance computer architecture | 2009

Fast complete memory consistency verification

Yunji Chen; Yi Lv; Weiwu Hu; Tianshi Chen; Haihua Shen; Pengyu Wang; Hong Pan

1.3%

Journal of Computer Science and Technology | 2007

Implementing a 1GHz four-issue out-of-order execution microprocessor in a standard cell ASIC methodology

Weiwu Hu; Ji-Ye Zhao; Shiqiang Zhong; Xu Yang; Elio Guidetti; Chris Wu

area of Godson-3, since it requires only trivial modifications to the existing components of Godson-3. The above features of LReplay demonstrate the potential of integrating hardware-assisted deterministic replay into future industrial processors.

international parallel processing symposium | 1999

Reducing system overheads in home-based software DSMs

Weiwu Hu; Weisong Shi; Zhimin Tang

The verification of an execution against memory consistency is known to be NP-hard. This paper proposes a novel fast memory consistency verification method by identifying a new natural partial order: time order. In multiprocessor systems with store atomicity, a time order restriction exists between two operations whose pending periods are disjoint: the former operation in time order must be observed by the latter operation. Based on the time order restriction, memory consistency verification is localized: for any operation, both inferring related orders and checking related cycles need to take into account only a bounded number of operations. Our method has been implemented in a memory consistency verification tool for CMP (Chip Multi Processor), named LCHECK. The time complexity of the algorithm in LCHECK is O(Cpp2n2) (where C is a constant, p is the number of processors and n is the number of operations) for soundly and completely checking, and O(p3n) for soundly but incompletely checking. LCHECK has been integrated into both pre and post silicon verification platforms of the Godson-3 microprocessor, and many bugs of memory consistency and cache coherence were found with the help of LCHECK.

Journal of Computer Science and Technology | 1998

A lock-based cache coherence protocol for scope consistency

Weiwu Hu; Weisong Shi; Zhimin Tang; Ming Li

This paper introduces the microarchitecture and physical implementation of the Godson-2E processor, which is a four-issue superscalar RISC processor that supports the 64-bit MIPS instruction set. The adoption of the aggressive out-of-order execution and memory hierarchy techniques help Godson-2E to achieve high performance. The Godson-2E processor has been physically designed in a 7-metal 90nm CMOS process using the cell-based methodology with some bit-sliced manual placement and a number of crafted cells and macros. The processor can be run at 1GHz and achieves a SPEC CPU2000 rate higher than 500.

international solid-state circuits conference | 2011

Godson-3B: A 1GHz 40W 8-core 128GFLOPS processor in 65nm CMOS

Weiwu Hu; Ru Wang; Yunji Chen; Baoxia Fan; Shiqiang Zhong; Xiang Gao; Zichu Qi; Xu Yang

Software DSM systems suffer from the high communication and coherence-induced overheads that limit performance. This paper introduces our efforts in reducing system overheads of a home-based software DSM called JIAJIA. Three measures, including eliminating false sharing through avoiding unnecessarily invalidating cached pages, reducing virtual memory page faults with a new write detection scheme, and propagating barrier message in a hierarchical way, are taken to reduce the system overhead of JIAJIA. Evaluation with some well-known DSM benchmarks reveals that, though varying with memory reference patterns of different applications, these measures can reduce system overhead of JIAJIA effectively.

high-performance computer architecture | 2010

DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance

Dan Tang; Yungang Bao; Weiwu Hu; Mingyu Chen

Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors. Although scalable to a certain extent, directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes. This paper proposes a lock-based cache coherence protocol for scope consistency. It does not rely on directory information to maintain cache coherence. Instead, cache coherence is maintained through requiring the releasing processor of a lock to store all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock. To evaluate the performance of the lock-based cache coherence protocol, a software DSM system named JIAJIA is built on network of workstations. Besides the lock-based cache coherence protocol, JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space. Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that, compared to recent SVM systems such as CVM, higher speedup is achieved by JIAJIA. Besides, JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation.

Explore More