Zuocheng Xing
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zuocheng Xing.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2017
Cang Liu; Zuocheng Xing; Luechao Yuan; Chuan Tang; Yang Zhang
QR decomposition (QRD) is one of the performance bottlenecks in a lot of high performance wireless communication algorithms and should have the flexibility property for future multiple-input multiple-output systems. However, the existing QRD architectures only focus on several fixed dimension matrices. The parallel tiled QRD algorithm is a perfect choice to implement QRD for its flexibility and modularity property. The size of the tile is set to 2 × 2 instead of the traditional 200 × 200 or more to support flexible antenna configurations in this brief. Using a look-ahead technique and the property of unitary matrix, a novel algorithm based on a modified Gram-Schmidt (MGS) algorithm is proposed for the bottleneck operations (GEQRT and TTQRT) of the parallel tiled QRD algorithm. A corresponding hardware architecture is also designed with the proposed algorithm. The implementation results show that the hardware architecture based on the proposed algorithm achieves a 2.7× reduction in normalized processing latency, compared with the one based on the traditional MGS algorithm.
ieee international conference on progress in informatics and computing | 2014
Qinglin Wang; Jie Liu; Xiantuo Tang; Feng Wang; Guitao Fu; Zuocheng Xing
The Embarrassingly Parallel (EP) algorithm which is typical of many Monte Carlo applications provides an estimate of the upper achievable limits for double precision performance of parallel supercomputers. Recently, Intel released Many Integrated Core (MIC) architecture as a many-core co-processor. MIC often offers more than 50 cores each of which can run four hardware threads as well as 512-bit vector instructions. In this paper, we describe how the EP algorithm is accelerated effectively on the platforms containing MIC using the offload execution model. The result shows that the efficient implementation of EP algorithm on MIC can take full advantage of MICs computational resources and achieves a speedup of 3.06 compared with that on Intel Xeon E5-2670 CPU. Based on the EP algorithm on MIC and an effective task distribution model, the implementation of EP algorithm on a CPU-MIC heterogeneous platform achieves the performance of up to 2134.86 Mop/s and 4.04 times speedup compared with that on Intel Xeon E5-2670 CPU.
high performance computing systems and applications | 2014
Qinglin Wang; Zuocheng Xing; Jie Liu; Xiaogang Qiang; Chunye Gong; Jiang Jiang
Single-node computation speed is essential in large-scale parallel solutions of particle transport problems. The Intel Many Integrated Core (MIC) architecture supports more than 200 hardware threads as well as 512-bit double precision float-point vector operations. In this paper, we use the native model of MIC in the parallelization of the simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The implementation adopts both hardware threads and vector units in MIC to efficiently exploit multi-level parallelism in the discrete ordinates method when keeping good data locality. Our optimized implementation is verified on target MIC and can provide up to 1.99 times speedup based on the original MPI code on Intel Xeon E5-2660 CPU when flux fixup is off. Compared with the prior on NVIDIA Tesla M2050 GPU, the speedup of up to 1.23 times is obtained. In addition, the difference between the implementations on MIC and GPU is discussed as well.
Future Generation Computer Systems | 2017
Yang Zhang; Zuocheng Xing; Cang Liu; Chuan Tang; Qinglin Wang
Abstract As the need for high performance computing continues to grow, it becomes more and more urgent to design a massive multi-core processor with high throughput and efficiency. However, when the number of cores keeps increasing, the capacity of on-chip memory is always insufficient. In a multi-core processor such as GPGPU (General Purpose Graphic Processor Unit), dozens or hundreds of SMs (Stream Multi-processor) coordinate to gain high throughput with several MB on-chip memory. Furthermore, in one SM, thousands of threads are organized as thread blocks to process instructions in a SIMT (Single Instruction Multiple Threads) manner. As all the threads share the same on-chip memory, the mismatch between large core number and small on-chip memory capacity can easily impair the performance due to excessive thread contention for cache resource. An efficient thread scheduling method is a promising way to alleviate the problems and to boost performance. From the hardware perspective, the instructions are executed by warps which are made up by a fixed number of threads. So we propose a novel warp scheduling scheme to maintain data locality and to relieve cache pollution and thrashing issues. First, to make full use of time locality, we put the disordered warps into a supervised warp queue and issue the warps from oldest to youngest. To utilize space locality and to hide computation unit stalls, we put forward a new insertion method called LPI (Locality Protected Insertion) to reorder warps in the supervised warp queue to better hide long-latency warps with short-latency warps such as ALU operations and on-chip accesses. Over a wide variety of applications, the new scheduling method gains at most 10.1% and an average of 2.2% improvements over the baseline loose round-robin scheduling.
Journal of Electrical and Computer Engineering | 2015
Feng Wang; Xiantuo Tang; Zuocheng Xing
Network-on-Chip (NoC) is one of critical communication architectures for futuremany-core systems. As technology is continually scaling down, on-chip network meets the increasing leakage power crisis. As a leakage power mitigation technique, power-gating can be utilized in on-chip network to solve the crisis. However, the network performance is severely affected by the disconnection in the conventional power-gated NoC. In this paper, we propose a novel partial power-gating approach to improve the performance in the power-gated NoC. The approach mainly involves a direction-slicing scheme, an improved routing algorithm, and a deadlock recovery mechanism. In the synthetic traffic simulation, the proposed design shows favorable power-efficiency at low-load range and achieves better performance than the conventional power-gated one. For the application trace simulation, the design in the mesh/torus network consumes 15.2%/18.9% more power on average, whereas it can averagely obtain 45.0%/28.7% performance improvement compared with the conventional power-gated design. On balance, the proposed design with partial power-gating has a better tradeoff between performance and power-efficiency.
International Journal of Electronics | 2016
Feng Wang; Xiantuo Tang; Zuocheng Xing; Hengzhu Liu
ABSTRACT Network-on-chip (NoC) is one of critical communication architectures for the scaling of future many-core processors. The challenge for on-chip network is reducing design complexity to save both area and power while providing high performance such as low latency and high throughput. Especially, with increase of network size, both design complexity and power consumption have become the bottlenecks preventing proper network scaling. Moreover, as technology continuously scales down, leakage power takes up a larger fraction of total NoC power. It is increasingly important for a power-efficient NoC design to reduce the increasing leakage power. Power-gating, as a representative low-power technique, can be applied to an on-chip network for mitigating leakage power. In this paper, we propose a low-cost and low-power router architecture for the unidirectional torus network, and adopt an improved corner buffer structure for the inoffensive power-gating, which has minimal impact on network performance. Besides, an explicit starvation avoidance mechanism is introduced to guarantee injection fairness while decreasing its negative impact on network throughput. Simulation results with synthetic traffic show that our design can improve network throughput by 11.3% on average and achieve significant power-saving in low- and medium-load regions. In the SPLASH-2 workload simulation, our design can save on average 27.2% of total power compared to the baseline, and decrease 42.8% average latency compared to the baseline with power-gating.
pacific rim conference on communications, computers and signal processing | 2015
Qinglin Wang; Jie Liu; Xiantao Cui; Guitao Fu; Chunye Gong; Zuocheng Xing
The coupling of microwaves into apertures plays an important part in many electromagnetic physics and engineering fields. When the width of apertures is very small, Finite Difference Time Domain (FDTD) simulation of the coupling is very time-consuming. As a many-core architecture, the Intels Many Integrated Core (MIC) architecture owns 512-bit vector units and more than 200 threads. In this paper, we parallelize FDTD simulation of microwave pulse coupling into narrow slots on the Intel MIC architecture. In the implementation, the parallel programming model OpenMP is used to exploit thread parallelism while loop unrolling and SIMD intrinsic functions are utilized to accomplish vectorization. Compared with the serial version on Intel Xeon E5-2670 CPU, the implementation on the MIC coprocessor including 57 cores obtains a speedup of 11.57 times. The experiment results also demonstrate that the parallelization has good scalability in performance. Additionally, how binding relationship between OpenMP threads and hardware threads in MIC influences performance is also reported.
network based information systems | 2015
Qinglin Wang; Jie Liu; Chunye Gong; Yang Zhang; Zuocheng Xing
The fast numerical solutions of Riesz fractional equation have computational cost of O(NMlogM), where M, N are the number of grid points and time steps. In this paper, we present a GPU-based fast solution for Riesz space fractional equation. The GPU-based fast solution, which is based on the fast method using FFT and implemented with CUDA programming model, consists of parallel FFT, vector-vector addition and vector-vector multiplication on GPU. The experimental results show that the GPU-based fast solution compares well with the exact solution. Compared to the known parallel fast solution on 8-core Intel E5-2670 CPU, the overall performance speedup on NVIDIA GTX650 GPU reaches 2.12 times and that on NVIDIA K20C GPU achieves 10.93 times.
international conference on information technology in medicine and education | 2015
Qinglin Wang; Jie Liu; Peizhen Xie; Chunye Gong; Yuan Li; Zuocheng Xing
Monte Carlo (MC) simulation plays an important part in dose calculation for radiotherapy treatment planning. Since the accuracy of MC simulation relies on the number of simulated particles histories, its very time-consuming. The Intel Many Integrated Core (MIC) architecture, which consists of more than 50 cores and supports many parallel programming models, provides an efficient alternative for accelerating MC dose calculation. This paper implements the OpenMP-based MC Dose Planning Method (DPM) for radiotherapy treatment problems on the Intel MIC architecture. The implementation has been verified on the target MIC coprocessor including 57 cores. The results demonstrate that the OpenMP-based DPM implementation exhibits very accurate results and achieves the maximum speedup of 10.53 times in comparison to the original DPM one on a Xeon E5-2670 CPU. Additionally, speedup and efficiency of the implementation running on the different number of cores in MIC are also reported.
international conference on electronics, communications, and computers | 2015
Feng Wang; Xiantuo Tang; Zuocheng Xing; Hengzhu Liu
Power consumption, design complexity and areacost are limiting constraints in the design of interconnect for scalable many-core systems. To tackle the power and area concerns, we propose a light-weight unidirectional channel network-on-chip in 2D mesh topology (UniMESH), which simplifies router architectures, uses only half amount of channel links to guarantee a fully connected topology, and adopts a novel routing algorithm and deadlock recovery mechanism. As a result, it can reduce both design complexity and area-cost, and decrease some unwanted power consumption. Evaluations show that the proposed light-weight UniMESH can reduce 57.4% router areas, and save 39.3% total power consumption and only add few extra latency when compared with conventional 2D mesh design in SPLASH application simulations.