Zichu Qi
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zichu Qi.
international solid-state circuits conference | 2011
Weiwu Hu; Ru Wang; Yunji Chen; Baoxia Fan; Shiqiang Zhong; Xiang Gao; Zichu Qi; Xu Yang
As the latest product of Godson processor series, the Godson-3B processor is an 8-core high-performance general-purpose processor implemented in 65nm CMOS low-power general-purpose mixed process with 7 layers of Cu metallization. Godson-3B contains 582.6M transistors (including 4MB L2-cache) within 299.8mm2 area. The number of signal pins in Godson-3B is 654. The highest frequency of Godson-3B is 1.05GHz, and the peak performance is 128GFLOPS (double-precision) or 256GFLOPS (single-precision) at 1GHz frequency with 40W power consumption. Godson-3B has an energy efficiency of 3.2GFLOPS/W. Other state-of-art high-performance processors have energy efficency of ∼1.3GFOPS/W (Westmere [2]) and ∼1.5GFLOPS/W (Power7 [3,4]). As shown in Fig. 4.4.1, Godson-3B contains 2 nodes while being able to scale to 16 nodes through inter-chip connection. Each node contains four cores, four L2-cache banks, one HyperTransport (HT) controller, one DDR2/3 controller, and the interconnection network connecting these components together. The interconnection network takes the 128-bit AXI standard interface with cache coherence extension.
international conference on vlsi design | 2010
Zichu Qi; Qi Guo; Ge Zhang; Xiangku Li; Weiwu Hu
This paper presents a floating-point fused multiply-add (FMA) unit with low-cost and low power techniques. To improve the performance, two single-precision operations can be performed concurrently with one double-precision datapath, which is very useful in multimedia and even scientific applications. Moreover, to reduce the additional area costs for supporting two single-precision operations in parallel, multiple double-precision units, i.e., the multiplier, shifter and adder, are reused as much as possible. A modified dual-path algorithm is proposed by classifying the exponent difference into three cases and implementing them with CLOSE and FAR paths, which can reduce latency and facilitate lowering power consumption by enabling only one of the two paths. In addition, in case of FADD instructions, the multiplier in the first stage is bypassed and kept in stable mode, which can significantly improve FADD instruction performance and lower power consumption. The overall FMA unit has a latency of 4 cycles while the FADD operation has 3 cycles. Each cycle has a time delay of about 0.66ns in the ST 65nm CMOS technology. Compared with the conventional double-precision FMA, about 13% delay is reduced and about 22% area is increased, which is acceptable since two single-precision results can be generated simultaneously.
international solid-state circuits conference | 2013
Weiwu Hu; Yifu Zhang; Liang Yang; Baoxia Fan; Yunji Chen; Shiqiang Zhong; Huandong Wang; Zichu Qi; Pengyu Wang; Xiang Gao; Xu Yang; Bin Xiao; Hongsheng Wang; Zongren Yang; Liqiong Yang; Shuai Chen
Godson-3B1500 is an 8-core microprocessor product of Loongson Technology™. It is fabricated in 32nm 10 Cu-layer high-κ metal-gate (HKMG) low-power bulk CMOS, and contains 1.14 billion transistors within 182.5mm2 die area. Through numerous design improvements, Godson-3B1500 is able to support a wide voltage range from 1.0V to 1.3V, at frequencies ranging from 1.0GHz to 1.5GHz, achieving 172.8GFLOPS at 1.35GHz, with nearly 40W power dissipation. This represents a 35% power-efficiency improvement over the previous design, Godson-3B [1].
asian test symposium | 2009
Zichu Qi; Hui Liu; Xiangku Li; Da Wang; Yinhe Han; Huawei Li; Weiwu Hu
This paper describes the scan test challenges and techniques used in the Godson-3 microprocessor, which is a scalable multicore processor based on the SMOC (Scalable Mesh of Crossbar) on-chip network and targets high-end applications. Advanced techniques are adopted to achieve the scalable, low-power and low-cost scan architecture at the challenge of limited I/O resources and large scale of transistors. To achieve a scalable and flexible test access, a highly elaborate TAM (Test Access Mechanism) is implemented with supporting multiple test instructions and test modes. Taking advantage of multiple cores embedding in the processor, scan partitions are employed to reduce test power and test time, and test compression with more than 10X compression ratio are utilized to decrease the scan chain length. To further decrease test time, a Data-Synchronous-Comparator (DSC) is proposed for comparing the scan responses of the identical cores.
Journal of Computer Science and Technology | 2006
Ge Zhang; Weiwu Hu; Zichu Qi
The algorithm and its implementation of the leading zero anticipation (LZA) are very vital for the performance of a high-speed floating-point adder in today’s state of art microprocessor design. Unfortunately, in predicting “shift amount” by a conventional LZA design, the result could be off by one position. This paper presents a novel parallel error detection algorithm for a general-case LZA. The proposed approach enables parallel execution of conventional LZA and its error detection, so that the error-indication signal can be generated earlier in the stage of normalization, thus reducing the critical path and improving overall performance. The circuit implementation of this algorithm also shows its advantages of area and power compared with other previous work.
international conference on electronics, circuits, and systems | 2009
Zichu Qi; Hui Liu; Xiangku Li; Jun Xu; Weiwu Hu
For a gigahertz microprocessor with multiple clock domains and a large amount of embedded RAMs (Random Access Memory), generating at-speed testing patterns is becoming very difficult and very time-consuming. This paper presents some novel techniques to improve at-speed testing coverage with low cost. These methods are major concern about preventing X states propagation, which include avoiding capturing X states for registers, sequential bypass of macros, clock control scheme for inter-clock domains and accurate analysis of exception paths in intra-clock domains. Functional patterns are utilized to further improve the efficiency of the at-speed testing. A novel optimal flow is presented by carefully selecting these techniques. By using the flow, 90% transition fault coverage is achieved. In addition, both the number of patterns and the test time of the transition test are decreased by 15%. The total area overhead is about a few hundreds of AND cells and has little timing impact on the critical paths.
international symposium on circuits and systems | 2005
Ge Zhang; Zichu Qi; Weiwu Hu
Archive | 2011
Weiwu Hu; Hui Liu; Zichu Qi
Archive | 2010
Yunji Chen; Weiwu Hu; Xiangku Li; Xiaoyu Li; Lin Ma; Zichu Qi
Archive | 2010
Zichu Qi; Qi Guo; Weiwu Hu