Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guanghui He is active.

Publication


Featured researches published by Guanghui He.


IEEE Transactions on Circuits and Systems Ii-express Briefs | 2013

VLSI Implementation of a High-Throughput Iterative Fixed-Complexity Sphere Decoder

Xi Chen; Guanghui He; Jun Ma

By exchanging soft information between the multiple-input multiple-output (MIMO) detector and the channel decoder, an iterative receiver can significantly improve the performance compared to the noniterative receiver. In this brief, a soft-input soft-output fixed-complexity-sphere-decoding algorithm and its very large scale integration architecture are proposed for the iterative MIMO receiver. The deeply pipelined architecture employs the optimized hybrid enumeration to search for the best child node estimate efficiently. By adding the counter hypotheses in parallel with other candidates, the proposed iterative MIMO detector improves the detection performance significantly with low detection latency. An iterative detector for an 4 × 4 64-quadrature amplitude modulation (QAM) MIMO system based on our proposed architecture is designed and implemented using the 90-nm CMOS technology. The detector can achieve a maximum throughput of 2.2 Gbit/s with an area efficiency of 3.96 Mbit/s/kGE, which is more efficient than other iterative MIMO detectors.


signal processing systems | 2011

Generalized interleaving network based on configurable QPP architecture for parallel turbo decoder

Shuaijie Wang; Jun Ma; Guanghui He; Zhigang Mao

Quadratic permutation polynomial (QPP) interleaver is a contention-free interleaver which is suitable for parallel turbo decoder implementation. In this paper, a systematic recursive method to design configurable QPP interleaving multistage network is proposed based on the property of QPP. Due to the nature of recursion, the proposed network for 2n-level parallel turbo decoder can be used for any 2i parallelism (0 < i ≤ n) without the need to redesign additional network for different level of parallelism. Address generator is modified to provide control signals to the network. Furthermore, the proposed QPP architecture is generalized to support arbitrary contention-free interleavers by appending an additional specially designed network. When the whole network is used in multi-standard design, the appended network can be turned off at QPP interleaver mode to reduce more than 49% dynamic power for parallelism greater than 16.


international symposium on circuits and systems | 2011

Memory efficient layered decoder design with early termination for LDPC codes

Jiangpeng Li; Guanghui He; Hexi Hou; Zhejun Zhang; Jun Ma

Layered structure is widely used in the design of Low-Density Parity-Check (LDPC) code decoders due to its fast convergence speed. However, correct checking process is difficult to implement in layered decoder, which results in unnecessary iterations. In this paper, an early termination strategy is presented for layered LDPC decoder to avoid redundant number of iterations. This approach makes use of the comparison between current log-likelyhood ratios (LLRs) and updated LLRs of all variable nodes to determine termination criteria of iterations. Furthermore, a non-uniform quantization scheme and an extrinsic messages memory optimization scheme are developed for memory savings. Based on these proposed methods, an LDPC decoder for the Chinese digital mobile TV applications is implemented using a SMIC 130nm CMOS process. The decoder consumes only 171 Kbits memory while achieving 267Mbps for code rate 1/2, and 401Mbps for code rate 3/4.


international symposium on circuits and systems | 2012

VLSI implementation of an 855 Mbps high performance soft-output K-Best MIMO detector

Chunhui Ju; Jun Ma; Chengzhi Tian; Guanghui He

Multiple-input multiple-output (MIMO) technique can significantly increase data throughput without sacrificing additional bandwidth. However, data detection at the receiver and its VLSI implementation is challenge due to high computation complexity. This paper presents the VLSI architecture and implementation for a 4×4 64-QAM soft-output K-Best MIMO detector. A novel deeply pipelined architecture which makes use of all the full-length ZF-augmented discarded paths (DPs) is designed to reduce complexity and improve BER performance. Furthermore, to save area and latency, two improvement methods-abandoning DPs of bottom levels and performing ZF-augmentation at the last stage are proposed. The presented detector improves the BER performance by 2.3dB at BER=10-3 compared to the conventional soft K-Best scheme when using the minimum mean squared error-sorted QR decomposition (MMSE-SQRD). It can achieve a peak throughput of 855 Mbps while consuming 223K gates, 301pJ/bit and 102 cycles for latency in SMIC 0.13μm CMOS process.


IEEE Transactions on Circuits and Systems | 2015

Design and Implementation of Flexible Dual-Mode Soft-Output MIMO Detector With Channel Preprocessing

Zhiting Yan; Guanghui He; Yifan Ren; Weifeng He; Jianfei Jiang; Zhigang Mao

This paper proposes a flexible dual-mode soft-output multiple-input multiple-output (MIMO) detector to support open-loop and closed-loop in Chinese enhanced ultra high throughput (EUHT) wireless local area network (LAN) standard. The proposed detector uses minimum mean square error (MMSE) sorted QR decomposition (MMSE-SQRD) to produce channel preprocessing result, which is realized by a modified systolic array architecture with concurrent sorting. Moreover, the adopted square-root MMSE algorithm for closed-loop reuses MMSE-SQRD preprocessing to largely save hardware overhead. In addition, an optimized K-Best detection algorithm is proposed for open-loop, which increases throughput by odd-even parallel sorting and produces high quality soft-output with discarded paths (DPs). A flexible VLSI architecture is designed for the proposed dual-mode detector, which supports 1×1 ~ 4×4 antennas and BPSK ~ 64-QAM modulation configuration. Implemented in SMIC 65 nm CMOS technology, the detector is capable of running at 550 MHz, which has a maximum throughput of 2.64 Gb/s for K-Best detection and 3.3 Gb/s for linear MMSE detection. The proposed detector is competitive to recent published works and meets the data-rate requirement of the EUHT standard.


international symposium on circuits and systems | 2014

Area and throughput efficient IDCT/IDST architecture for HEVC standard

Ziyou Yao; Weifeng He; Liang Hong; Guanghui He; Zhigang Mao

High Efficiency Video Coding (HEVC) is new video coding standard beyond H.264/AVC. In this paper, an area and throughput efficient 2-D IDCT/IDST VLSI architecture for HEVC standard is presented. Adopting proposed data flow scheduling and shared constant multiplication structure, the architecture supports variable block size IDCT from 4×4 to 32×32 pixels as well as 4×4 pels IDST. Using 65nm technology, the synthesis results show that the maximum work frequency is 500MHz and the architecture hardware cost is about 145.4K gate count. Compared with previous work, our design achieves more than 50% reduction in hardware cost and 66% improvement in throughput efficiency. Experimental results show that the proposed architecture is able to deal with real-time HEVC IDCT/IDST of 4K×2K (4096×2048)@30 fps video sequence at 412MHz in average. In consequence, it offers a cost-effective solution for the future UHDTV applications.


Integration | 2016

High performance parallel turbo decoder with configurable interleaving network for LTE application

Zhiting Yan; Guanghui He; Weifeng He; Shuaijie Wang; Zhigang Mao

In this paper, a high performance parallel turbo decoder is designed to support 188 block sizes in the 3rd generation partnership (3GPP) long term evolution (LTE) standard. A novel configurable quadratic permutation polynomial (QPP) multistage network and address generator are proposed to reduce the complexity of interleaving. This 2n-input network can be configured to support any 2i-input ( 0 ? i ? n ) network. Furthermore, it can flexibly support arbitrary contention-free interleavers by cascading an additional specially designed network. In addition, an optimized decoding schedule scheme is presented to reduce the performance loss caused by high parallelism. Memory architecture and address mapping method are optimized to avoid memory access contention of small blocks. Moreover, a dual-mode add-compare-select (ACS) unit implementing both radix-2 and radix-4 recursion is proposed to support the block sizes that are not divided by 16. Implemented in 130nm CMOS technology, the design achieves 384.3Mbps peak throughput at clock rate of 290MHz with 5.5 iterations. Consuming 4.02mm2 core area and 716mW power, the decoder has a 1.81bits/cycle/iteration/mm2 architecture efficiency and a 0.34nJ/bit/iteration energy efficiency, which is competitive with other recent works. HighlightsConfigurable interleaving network supports any 2i-input ( 0 ? i ? n ) network. A new low complexity address generator for interleaving.Optimized decoding schedule reduces performance loss caused by parallel turbo decoding.Configurable memory architecture is proposed to avoid memory access contention. Dual-mode ACS unit for both radix-2 and radix-4 processing.The proposed parallel turbo decoder supports all 188 block sizes in LTE.


international symposium on circuits and systems | 2012

High-throughput sorted MMSE QR decomposition for MIMO detection

Yifan Ren; Guanghui He; Jun Ma

The sorted QR decomposition (SQRD) has become a critical prerequisite for non-linear detection algorithms such as K-best and sphere decoding for multiple-input multiple-output (MIMO) systems. However, due to the sorting and norm updating procedures, these systems are difficult to achieve high-throughput applications for lack of efficiency and parallelism. In this paper, an efficient VLSI implementation combining modified array architecture with sorting operations is proposed to increase parallel processing abilities. In addition, a novel sorting look ahead updating scheme is employed to advance sorting operations, which reduces the processing latency. Moreover, ℓ1-norm is adopted instead of the original ℓ2-norm to further simplify the hardware complexity. The proposed SQRD preprocessor implemented in SMIC 0.13μm CMOS technology achieves the throughput up to 25×106 SQRD per second which outperforms other works with equal functionality.


CCF National Conference on Compujter Engineering and Technology | 2012

DC Offset Mismatch Calibration for Time-Interleaved ADCs in High-Speed OFDM Receivers

Yulong Zheng; Zhiting Yan; Jun Ma; Guanghui He

Zero Intermediate Frequency (zero-IF) receivers with two analog-to-digital converters (ADCs) in In-Phase and Quadrature (IQ) branches are widely used in emerging multi-Gigabit wireless Orthogonal Frequency Division Multiplexing (OFDM) systems. Because ordinary ADCs could not meet the demands of sampling rate in the system, two time-interleaved analog-to-digital converters (TI-ADCs) could be an attractive alternative for sampling speed improvement in the receiver. However, the mismatches among the parallel sub-ADCs can degrade the performance significantly without calibration. Targeting the DC offset mismatch of the TI-ADCs, this paper proposes calibration algorithm based on decorrelation least-mean-squares (LMS) and recursive-least-square (RLS) utilizing the comb-type pilots in OFDM frame, which could calibrate the two TI-ADCs in (IQ) branches simultaneously. The calibration algorithm has the property of fast convergence. Simulation results show that the BER performance is improved by the proposed algorithm.


international symposium on circuits and systems | 2011

Effective multi-standard macroblock prediction VLSI design for reconfigurable multimedia systems

Yuliang Tao; Guanghui He; Weifeng He; Qin Wang; Jun Ma; Zhigang Mao

Reconfigurable computing arrays facilitate the flexibility with high performance for regular and computation-intensive algorithms in multimedia processing. However, the efficiency of the irregular and control-intensive algorithms becomes the performance bottleneck of reconfigurable multimedia systems. In this paper, we propose the design and VLSI implementation of a novel memory efficient macroblock prediction and boundary strength (Bs) calculation engine. The control-intensive algorithms, including intra mode prediction, motion vector prediction, and Bs calculation, are implemented with 4x4 block level pipeline to achieve real-time decoding for H.264/AVC high profile and Chinese AVS Jizhun profile. Compared with existing designs, our design achieves 60% registers reduction for neighboring block load and update. Implementation results indicate that the proposed architecture can support 1920×1088@30fps of H.264 and AVS decoding at 86 MHz.

Collaboration


Dive into the Guanghui He's collaboration.

Top Co-Authors

Avatar

Jun Ma

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Zhigang Mao

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Weifeng He

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Jiangpeng Li

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Zhiting Yan

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Xi Chen

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Wei Jin

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Chengzhi Tian

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Liang Hong

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Qin Wang

Shanghai Jiao Tong University

View shared research outputs
Researchain Logo
Decentralizing Knowledge