Hu Weiwu
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hu Weiwu.
Journal of Computer Science and Technology | 2001
Hu Weiwu; Zhang Fuxin; Liu Haiming
A major overhead in software DSM (Distributed Shared Memory) is the cost of remote memory accesses necessitated by the protocol as well as induced by false sharing. This paper introduces a dynamic prefetching method implemented in the JIAJIA software DSM to reduce system overhead caused by remote accesses. The prefetching method records the interleaving string of INV (invalidation) and GETP (getting a remote page) operations for each cached page and analyzes the periodicity of the string when a page is invalidated on a lock or barrier. A prefetching request is issued after the lock or barrier if the periodicity analysis indicates that GETP will be the next operation in the string. Multiple prefetching requests are merged into the same message if they are to the same host. Performance evaluation with eight well-accepted benchmarks in a cluster of sixteen Power PC workstations shows that the prefetching scheme can significantly reduce the page fault overhead and as a result achieves a performance increase of 15%–20% in three benchmarks and around 8%–10% in another three. The average extra traffic caused by useless prefetches is only 7%–13% in the evaluation.A major overhead in software DSM (Distributed Shared Memory) is the cost of remote memory accesses necessitated by the protocol as well as induced by false sharing. This paper introduces a dynamic prefetching method implemented in the JIAJIA software DSM to reduce system overhead caused by remote accesses. The prefetching method records the interleaving string of INV (invalidation) and GETP (getting a remote page) operations for each cached page and analyzes the periodicity of the string when a page is invalidated on a lock or barrier. A prefetching request is issued after the lock or barrier if the periodicity analysis indicates that GETP will be the next operation in the string. Multiple prefetching requests are merged into the same message if they are to the same host. Performance evaluation with eight well-accepted benchmarks in a cluster of sixteen Power PC workstations shows that the prefetching scheme can significantly reduce the page fault overhead and as a result achieves a performance increase of 15%–20% in three benchmarks and around 8%–10% in another three. The average extra traffic caused by useless prefetches is only 7%–13% in the evaluation.
international conference on electronics, circuits, and systems | 2008
Ma Lin; Chen Yunji; Su Menghao; Qi Zichu; Zhang Heng; Hu Weiwu
CAM is widely used in microprocessors and SOC TLB modules. It gives great advantage for software development. And TLB operations become bottleneck of the microprocessor performance. The test cost of normal BIST approach of the CAM can not be ignored. The paper analyses the fault models of CAM and proposes an instruction suitable march-like algorithm. The algorithm requires 14N+2L operations, where N is the number of words of the CAM and L is the width of a word. The algorithm covers 100% targeted faults. Instruction-level test using the algorithm has not any test cost on area and performance. Moreover the algorithm can be used in BIST approaches and have less performance lost for microprocessors. The paper instances the algorithm in a MIPS compatible microprocessor and have good results.
Journal of Computer Science and Technology | 1998
Hu Weiwu; Shi Weisong; Tang Zhimin
Previous descriptions of memory consistency models in shared-memory multiprocessor systems are mainly expressed as constraints on the memory access event ordering and hence are hardware-centric. This paper presents a framework of memory consistency models which describes the memory consistency model on the behavior level. Based on the understanding that the behavior of an execution is determined by the execution order of conflicting accesses, a memory consistency model is defined as an interprocessor synchronization mechanism which orders the execution of operations from different processors. Synchronization order of an execution under certain consistency model is also defined. The synchronization order, together with the program order, determines the behavior of an execution.This paper also presents criteria for correct program and correct implementation of consistency models. Regarding an implementation of a consistency model as certain memory event ordering constraints, this paper provides a method to prove the correctness of consistency model implementations, and the correctness of the lock-based cache coherence protocol is proved with this method.
asian solid state circuits conference | 2009
Zhang Feng; Yang Yi; Yang Zongren; Patrick Chiang; Hu Weiwu
This paper describes a 65nm 16-bit parallel transceiver IP macro, whose bandwidth is 4.8GByte/s with 5pf load including the HBM 2000v ESD protection. Equalizers and CDR modules, CRC checkers and 8b/10b encoders are not added in the design for reducing the latency and the whole latency is 7ns without cables. Since the transceiver has many robust features including a PVT independent PLL with calibrations, the low skew differential clock tree, a stable current mode driver with common mode feedback. The transceiver can tolerance 20% power supply variations and work properly at different process corners and the extreme temperatures. The transceiver can be applied for the interface of sub-100nm high performance processors which require low latency and high stability. The transceiver shows a BER less than 10-15 at 3Gb/s/pin.
Journal of Semiconductors | 2009
Yang Yi; Gao Zhuo; Yang Liqiong; Huang Lingyi; Hu Weiwu
An ultra-wideband (3.1–10.6 GHz) low-noise amplifier using the 0.18 μm CMOS process is presented. It employs a wideband filter for impedance matching. The current-reused technique is adopted to lower the power consumption. The noise contributions of the second-order and third-order Chebyshev filers for input matching are analyzed and compared in detail. The measured power gain is 12.4–14.5 dB within the bandwidth. NF ranged from 4.2 to 5.4 dB in 3.1–10.6 GHz. Good input matching is achieved over the entire bandwidth. The test chip consumes 9 mW (without output buffer for measurement) with a 1.8 V power supply and occupies 0.88 mm2.
international symposium on circuits and systems | 2009
Gao Zhuo; Divya Kesharwani; Patrick Chiang; Hu Weiwu
A DSP method based on sub-sampling followed by M-point FFT of the sub-sampled signal is used to reduce the phase-locked loops reference spur. To validate the systems effectiveness a digital calibration loop in SIMULINK is designed. The results show that the reference spur can be improved by 22dBc with a 1% residual current mismatch and a 1nA net value of leakage current.
Journal of Semiconductors | 2009
Gao Zhuo; Yang Zongren; Zhao Ying; Yang Yi; Zhang Lu; Huang Lingyi; Hu Weiwu
This paper presents the design of a 10 Gb/s low power wire-line receiver in the 65 nm CMOS process with 1 V supply voltage. The receiver occupies 300 × 500 μm2. With the novel half rate period calibration clock data recovery (CDR) circuit, the receiver consumes 52 mW power. The receiver can compensate a wide range of channel loss by combining the low power wideband programmable continuous time linear equalizer (CTLE) and decision feedback equalizer (DFE).
Journal of Semiconductors | 2009
Gao Zhuo; Yang Yi; Zhong Shiqiang; Yang Xu; Huang Lingyi; Hu Weiwu
This paper presents the design of a 10 Gb/s PAM2, 20 Gb/s PAM4 high speed low power wire-line transceiver equalizer in a 65 nm CMOS process with 1 V supply voltage. The transmitter occupies 430 × 240 μm2 and consumes 50.56 mW power. With the programmable 5-order pre-emphasis equalizer, the transmitter can compensate for a wide range of channel loss and send a signal with adjustable voltage swing. The receiver equalizer occupies 146 × 186 μm2 and consumes 5.3 mW power.
ieee region 10 conference | 2006
Ma Lin; Shen Haihua; Hu Weiwu
Simulation for dynamic timing analysis (DTA) with the delay information is now the most time-cost step of VLSI design. Although recent years, static timing analysis (STA) is developed well for timing analysis and check, it is no use for such designs as multi-clock-domain designs, so the DTA can not be replaced. In the paper, we propose a STA based simulation acceleration methodology for such designs that STA can not work well. We simulate the design with the mixture of the delay information of some modules and others of no delay information, so that we can finish simulations in very shorter time. The time saved is depended on the designs realization. In the Godson2s work, we get a very good result that we finish the simulation in one-sixth time of ordinary method
Archive | 2005
Hu Weiwu; Li Zusong; Qi Zichu