Heng Quan
Fudan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Heng Quan.
international solid-state circuits conference | 2013
Peng Ou; Jiajie Zhang; Heng Quan; Yi Li; Maofei He; Zheng Yu; Xueqiu Yu; Shile Cui; Jie Feng; Shikai Zhu; Jie Lin; Ming'e Jing; Xiaoyang Zeng; Zhiyi Yu
With the increasing complexity and variety of applications, programmable multi-core processors are drawing attention due to their high flexibility and low implementation cost, yet their performance and energy efficiency still cannot fulfill the demands of many compute-intensive applications. This paper describes a high-performance energy-efficient 24-core processor for multi-media and communication applications, with the following key features: (1) a packet-controlled circuit-switched double-layer network-on-chip (NoC) which provides 11Tb/s/W energy efficiency with 435Gb/s bisection-bandwidth; (2) a cluster-shared NoC-connected heterogeneous reconfigurable execution array, which can improve the performance of frequently used computations in multimedia and communication applications by over 6×; (3) memory hierarchy improvements, including a multi-page foreground and background register file, and memory splitting and sharing. The processor, implemented in TSMC 65nm CMOS LP and occupying 18.8mm2 (Fig. 3.6.7) operates at 850MHz at 1.2V, with 523mW power dissipation and 39GOPS/W (26pJ/operation) energy efficiency, which is 1.75× better than our former 16-core processor [3].
international solid-state circuits conference | 2012
Zhiyi Yu; Kaidi You; Ruijin Xiao; Heng Quan; Peng Ou; Yan Ying; Haofan Yang; Ming'e Jing; Xiaoyang Zeng
Almost all multicore processors use a shared-memory architecture due to its simple programming model. Recently, however, the message-passing mechanism is also drawing attention due to its potentially better scalability. In this work, we demonstrate that a hybrid communication mechanism supporting both message passing and shared memory can provide both higher performance and energy efficiency. This 16-core processor has 3 key features: (1) A cluster-based hierarchical architecture supporting both shared-memory and message-passing communication. (2) A cache-free memory hierarchy with an extended register file, small private memory and moderate shared memory to avoid complex cache coherence issues and achieve high energy efficiency by keeping data accesses local. (3) A hardware-aided mailbox mechanism to accelerate the synchronization procedure between different processor nodes. With these techniques, our multicore processor can provide high performance for many applications. Chip test results show that its maximum clock frequency is 800MHz and typical power consumption is 320mW, when running basic applications with clock gating at 1.2V at room temperature.
international symposium on circuits and systems | 2012
Yan Ying; Kaidi You; Liyang Zhou; Heng Quan; Ming'e Jing; Zhiyi Yu; Xiaoyang Zeng
As an error correction code, Low Density Parity Check (LDPC) code has been widely used in various communication standards such as WiMAX and DVB-S2. But these continuously-evolving communication standards and the high development cost and low-flexibility of hardwired ASIC solutions have pushed LDPC researchers to turn to more cost-efficient and flexible implementation, and thus the multi-core processor based implementation of LDPC decoder is gaining increasing attention in the last few years. However, the performance of the multi-core processor based implementation is far below the hardwired ASICs, with one of the key reasons that the cost of communication between processors is very high. Three approaches are proposed in this paper to reduce the communication cost, including: optimized algorithm partitioning to reduce communication traffic, utilizing imbalanced communication between tasks to optimize mapping and reduce overall communication distance, and simplified data sending-receiving mechanism to reduce the cost of identifying received data. By using these approaches, the communication time of the proposed implementation of LDPC decoder only accounts for 12.2% of total decoding time, which generally occupies 50% decoding time in the previously reported LDPC decoders on multi-core processors. And our work can achieve better throughput performance under the same hardware condition compared with other state-of-the-art works.
IEEE Transactions on Circuits and Systems | 2014
Zhiyi Yu; Ruijin Xiao; Kaidi You; Heng Quan; Peng Ou; Zheng Yu; Maofei He; Jiajie Zhang; Yan Ying; Haofan Yang; Jun Han; Xu Cheng; Zhang Zhang; Ming'e Jing; Xiaoyang Zeng
A 16-core processor with both message-passing and shared-memory inter-core communication mechanisms is implemented in 65 nm CMOS. Message-passing communication is enabled in a 3 × 6 Mesh packet-switched network-on-chip, and shared-memory communication is supported using the shared memory within each cluster. The processor occupies 9.1 mm2 and operates fully functional at a clock rate of 750 MHz at 1.2 V and maximum 800 MHz at 1.3 V. Each core dissipates 34 mW under typical conditions at 750 MHz and 1.2 V while executing embedded applications such as an LDPC decoder, a 3780-point FFT module, an H.264 decoder and an LTE channel estimator.
ieee international conference on solid-state and integrated circuit technology | 2010
Heng Quan; Ruijin Xiao; Kaidi You; Xiaoyang Zeng; Zhiyi Yu
This paper presents a 32-bit vector multiply-accumulate (MAC) architecture capable of supporting multiple precisions. The vector MAC can perform one 32÷32, one 32÷16, two 16÷16, four 8÷8 bit signed/unsigned multiply-accumulate using Booth encoding algorithm and Wallace tree compressing. A reconfigurable Booth encoding array is implemented using 8÷8 Booth unit as the basic element, and longer bit modes are obtained by combining these elements selectively. This MAC unit can also perform multiply between scalar and vector operands. 32-bit SIMD (Single Instruction Multiple Date) extended ISA (Instruction Set Architecture) and 3-stage pipeline are implemented for the MAC unit. The design is synthesized in 0.13um SMIC technology under worst case condition, and the critical path of MAC is 2.5ns.
international conference on asic | 2011
Yueming Yang; Heng Quan; Zewen Shi; Xiaoyang Zeng; Zhiyi Yu
In this paper we propose a Modified-Minimal-Connect-Component (MMCC) fault block model, which is improved from Minimal-Connected-Component (MCC) fault block model by decoupling links and nodes, to deal with defective links and nodes for 2D-mesh NoCs. Simulation results show that MMCC achieves higher nodes utilization rate than MCC and almost the same rate as orthogonal convex polygons fault block (OFB) model with faulty nodes only, and the advantage of MMCC becomes more significant when both nodes and links are faulty. For example, it achieves 25% higher nodes utilization rate than OFB with 10% faulty nodes and links in 25×25 mesh topology.
ieee international conference on solid-state and integrated circuit technology | 2010
Ruijin Xiao; Heng Quan; Kaidi You; Bei Huang; Xiaoyang Zeng; Zhiyi Yu
This paper proposes a novel multi-core processor with SIMD(Single Instruction Multiple Data) ISA (Instruction Set Architecture) and extended register file for communication applications. To acquire better parallel computing capability, we implement SIMD ISA and increase the number of register file from 32 to 64. 5×5 homogeneous 2-D mesh NoC (Network-on-Chip) topology is adopted to further enhance the parallelism, scalability and programmability. RS (Reed-Solomon) (255,239,8) decoding algorithm is implemented to evaluate the performance. Simulation result shows it could achieve 2.175 Gbps of throughput in worst case of RS 8-error incoming codeword under a maximum 350MHz clock frequency at SMIC 0.13um worst process corner, and the throughput is higher than other published implementations.
ieee international conference on solid-state and integrated circuit technology | 2010
Xingxing Zhang; Zewen Shi; Heng Quan; Xiaoyang Zeng; Zhiyi Yu
Network-on-Chip (NoC) is the most promising on-chip-interconnection scheme for multi-core processors. In this paper, we propose a novel NoC architecture called Stargon, which is inspired by the Spidergon. A simulation model has been developed to evaluate our architecture. We study the effect of the number of nodes, buffer depth and message length on the performance, and shows that at any situation Stargon is twice of performance compared with Spidergon.
The Japan Society of Applied Physics | 2013
Zheng Yu; Xueqiu Yu; Shikai Zhu; Peng Ou; Jiajie Zhang; Maofei He; Shile Cui; Kaidi You; Ruijin Xiao; Heng Quan; Xiaoyang Zeng
Two multi-core processors are implemented in 65nm CMOS. One is a 16-core processor integrating both message-passing and shared-memory inter-core communication mechanisms. The other is a 24-core processor with packet controlled circuit-switched double-layer Network-on-Chip and heterogeneous execution array for specific applications. The 16-core processor occupies 9.1mm 2 and operates fully functional at a clock rate of 750MHz at 1.2V with typical power dissipation of 34mW per core. While the 24-core one occupies 18.8 mm 2 and the clock frequency is improved to 850MHz at 1.2V with power consumption reduced to 22mW per core. LTE digital baseband and H.264 baseline intra decoder are implemented on the two multi-core processors with demonstration platform.
Archive | 2010
Ruijin Xiao; Heng Quan; Zhiyi Yu; Xiaoyang Zeng