Cang Liu
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Cang Liu.
IEEE Communications Letters | 2016
Chuan Tang; Cang Liu; Luechao Yuan; Zuocheng Xing
Currently, massive multiple-input multiple-output (MIMO) is one of the most promising wireless transmission technologies for 5G. Massive MIMO requires handling with large-scale matrix computation, especially for matrix inversion. In this letter, we find that matrix inversion based on Newton iteration (NI) is suitable for data detection in massive MIMO system. In contrast with recently proposed polynomial expansion (PE) method for matrix inversion, we analyze both the algorithm complexity and precision in detail, and propose a diagonal band Newton iteration (DBNI) method, which is an approximate method for NI. Compared with PE method, DBNI can obtain higher precision and approximately equal complexity, and we give an explanation of how to select the bandwidth of DBNI.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2017
Cang Liu; Zuocheng Xing; Luechao Yuan; Chuan Tang; Yang Zhang
QR decomposition (QRD) is one of the performance bottlenecks in a lot of high performance wireless communication algorithms and should have the flexibility property for future multiple-input multiple-output systems. However, the existing QRD architectures only focus on several fixed dimension matrices. The parallel tiled QRD algorithm is a perfect choice to implement QRD for its flexibility and modularity property. The size of the tile is set to 2 × 2 instead of the traditional 200 × 200 or more to support flexible antenna configurations in this brief. Using a look-ahead technique and the property of unitary matrix, a novel algorithm based on a modified Gram-Schmidt (MGS) algorithm is proposed for the bottleneck operations (GEQRT and TTQRT) of the parallel tiled QRD algorithm. A corresponding hardware architecture is also designed with the proposed algorithm. The implementation results show that the hardware architecture based on the proposed algorithm achieves a 2.7× reduction in normalized processing latency, compared with the one based on the traditional MGS algorithm.
Iet Circuits Devices & Systems | 2016
Cang Liu; Chuan Tang; Luechao Yuan; Zuocheng Xing; Yang Zhang
QR decomposition is extensively adopted in multiple-input–multiple-output orthogonal frequency-division multiplexing wireless communication systems, and is one of the performance bottlenecks in lots of high-performance wireless communication algorithms. To implement low processing latency QR decomposition with hardware, the authors propose a novel iterative look-ahead modified Gram–Schmidt (ILMGS) algorithm based on the traditional modified Gram–Schmidt (MGS) algorithm. They also design the corresponding triangular systolic array (TSA) architecture with the proposed ILMGS algorithm, which only needs n time slots for a n × n real matrix. For reducing the hardware overhead, they modify the TSA architecture into an iterative architecture. They also design a modified iterative architecture to further reduce the hardware overhead. The implementation results show that the normalised processing latency of the modified iterative architecture based on the proposed ILMGS algorithm is 1.36 times lower than the one based on the MGS algorithm. To the best of the authors’ knowledge, the designed architecture achieves the superior latency performance than the existing works.
international symposium on communications and information technologies | 2014
Yang Zhang; Zuocheng Xing; Luechao Yuan; Cang Liu; Qinglin Wang
In the paper, a new implementation of a 3GPP LTE standards compliant turbo decoder based on GPGPU is proposed. It uses the newest GPU-Tesla K20c, which is based on the Kepler GK110 architecture. The new architecture has more powerful parallel computing capability and we use it to fully exploit the parallelism in the turbo decoding algorithm in novel ways. Meanwhile, we use various memory hierarchies to meet various kinds of data demands on speed and capacity. Simulation shows that our implementation is practical and it gets 76% improvement on throughput over the latest GPU implementation. The result demonstrates that the newest Kepler architecture is suitable for turbo decoding and it can be a promising reconfigurable platform for the communication system.
IEEE Transactions on Very Large Scale Integration Systems | 2017
Cang Liu; Chuan Tang; Zuocheng Xing; Luechao Yuan; Yang Zhang
QR decomposition (QRD) has been a vital component in the transceiver processor of future multiple-input multiple-output (MIMO) systems, in which antenna configuration will be more and more flexible. Therefore, the QRD hardware architecture in the future MIMO systems should be more flexible to meet various antenna configurations. Unfortunately, the existing QRD hardware architectures mainly focus on the matrix of one or several fixed sizes. This paper presents a new triangular systolic array QRD hardware architecture based on parallel tiled QRD algorithm to decompose an
Future Generation Computer Systems | 2017
Yang Zhang; Zuocheng Xing; Cang Liu; Chuan Tang; Qinglin Wang
{8}\times {8}
CCF National Conference on Compujter Engineering and Technology | 2015
Chuan Tang; Cang Liu; Luechao Yuan; Zuocheng Xing
real matrix. The designed hardware architecture is flexible and can be used in various MIMO systems, in which the number of antennas is smaller than 4. This paper also proposes a modified algorithm for the bottleneck operations of parallel tiled QRD algorithm to reduce the hardware overhead. To further reduce the hardware overhead, the Newton–Raphson algorithm is adopted in the proposed algorithm. The implementation results show that the normalized processing latency performance and the normalized processing efficiency performance of the designed QRD hardware architecture both are better than most of the existing QRD hardware architectures. To the best of our knowledge, the hardware architecture presented in this paper achieves the superior normalized QRD rate performance to the existing QRD hardware architectures.
Journal of Zhejiang University Science C | 2018
Yang Zhang; Zuocheng Xing; Cang Liu; Chuan Tang
Abstract As the need for high performance computing continues to grow, it becomes more and more urgent to design a massive multi-core processor with high throughput and efficiency. However, when the number of cores keeps increasing, the capacity of on-chip memory is always insufficient. In a multi-core processor such as GPGPU (General Purpose Graphic Processor Unit), dozens or hundreds of SMs (Stream Multi-processor) coordinate to gain high throughput with several MB on-chip memory. Furthermore, in one SM, thousands of threads are organized as thread blocks to process instructions in a SIMT (Single Instruction Multiple Threads) manner. As all the threads share the same on-chip memory, the mismatch between large core number and small on-chip memory capacity can easily impair the performance due to excessive thread contention for cache resource. An efficient thread scheduling method is a promising way to alleviate the problems and to boost performance. From the hardware perspective, the instructions are executed by warps which are made up by a fixed number of threads. So we propose a novel warp scheduling scheme to maintain data locality and to relieve cache pollution and thrashing issues. First, to make full use of time locality, we put the disordered warps into a supervised warp queue and issue the warps from oldest to youngest. To utilize space locality and to hide computation unit stalls, we put forward a new insertion method called LPI (Locality Protected Insertion) to reorder warps in the supervised warp queue to better hide long-latency warps with short-latency warps such as ALU operations and on-chip accesses. Over a wide variety of applications, the new scheduling method gains at most 10.1% and an average of 2.2% improvements over the baseline loose round-robin scheduling.
Iet Computers and Digital Techniques | 2017
Yang Zhang; Zuocheng Xing; Chuan Tang; Cang Liu
Currently 5G is research hotspot in communication field, and one of the most promising wireless transmission technologies for 5G is massive multiple input multiple output (MIMO) which provides high data rate and energy efficiency. The main challenge of massive MIMO is the channel estimation due to the complexity and pilot contamination. Some improvement of traditional channel estimation methods to solve the problem in massive MIMO have been introduced in this paper. Besides, the hardware acceleration is useful for massive MIMO channel estimation algorithm. We discuss the relate work about hardware accelerator of matrix inversion and singular value decomposition which are the main complex operations of channel estimation. We find that the memory system, network of processing elements and the precision will be the main research directions for the hardware design of large-scale data size.
Iet Communications | 2017
Chuan Tang; Cang Liu; Luechao Yuan; Zuocheng Xing
As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU’s poor warp scheduling method. Thus, benefits of GPU’s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.