Is this you? Create Your Porfile

Xiantuo Tang

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiantuo Tang is active.

Explore More

Publication

Featured researches published by Xiantuo Tang.

ieee international conference on progress in informatics and computing | 2014

Accelerating embarrassingly parallel algorithm on Intel MIC

Qinglin Wang; Jie Liu; Xiantuo Tang; Feng Wang; Guitao Fu; Zuocheng Xing

The Embarrassingly Parallel (EP) algorithm which is typical of many Monte Carlo applications provides an estimate of the upper achievable limits for double precision performance of parallel supercomputers. Recently, Intel released Many Integrated Core (MIC) architecture as a many-core co-processor. MIC often offers more than 50 cores each of which can run four hardware threads as well as 512-bit vector instructions. In this paper, we describe how the EP algorithm is accelerated effectively on the platforms containing MIC using the offload execution model. The result shows that the efficient implementation of EP algorithm on MIC can take full advantage of MICs computational resources and achieves a speedup of 3.06 compared with that on Intel Xeon E5-2670 CPU. Based on the EP algorithm on MIC and an effective task distribution model, the implementation of EP algorithm on a CPU-MIC heterogeneous platform achieves the performance of up to 2134.86 Mop/s and 4.04 times speedup compared with that on Intel Xeon E5-2670 CPU.

Journal of Electrical and Computer Engineering | 2015

Applying partial power-gating to direction-sliced network-on-chip

Feng Wang; Xiantuo Tang; Zuocheng Xing

Network-on-Chip (NoC) is one of critical communication architectures for futuremany-core systems. As technology is continually scaling down, on-chip network meets the increasing leakage power crisis. As a leakage power mitigation technique, power-gating can be utilized in on-chip network to solve the crisis. However, the network performance is severely affected by the disconnection in the conventional power-gated NoC. In this paper, we propose a novel partial power-gating approach to improve the performance in the power-gated NoC. The approach mainly involves a direction-slicing scheme, an improved routing algorithm, and a deadlock recovery mechanism. In the synthetic traffic simulation, the proposed design shows favorable power-efficiency at low-load range and achieves better performance than the conventional power-gated one. For the application trace simulation, the design in the mesh/torus network consumes 15.2%/18.9% more power on average, whereas it can averagely obtain 45.0%/28.7% performance improvement compared with the conventional power-gated design. On balance, the proposed design with partial power-gating has a better tradeoff between performance and power-efficiency.

International Journal of Electronics | 2016

Low-cost and low-power unidirectional torus network-on-chip with corner buffer power-gating

Feng Wang; Xiantuo Tang; Zuocheng Xing; Hengzhu Liu

ABSTRACT Network-on-chip (NoC) is one of critical communication architectures for the scaling of future many-core processors. The challenge for on-chip network is reducing design complexity to save both area and power while providing high performance such as low latency and high throughput. Especially, with increase of network size, both design complexity and power consumption have become the bottlenecks preventing proper network scaling. Moreover, as technology continuously scales down, leakage power takes up a larger fraction of total NoC power. It is increasingly important for a power-efficient NoC design to reduce the increasing leakage power. Power-gating, as a representative low-power technique, can be applied to an on-chip network for mitigating leakage power. In this paper, we propose a low-cost and low-power router architecture for the unidirectional torus network, and adopt an improved corner buffer structure for the inoffensive power-gating, which has minimal impact on network performance. Besides, an explicit starvation avoidance mechanism is introduced to guarantee injection fairness while decreasing its negative impact on network throughput. Simulation results with synthetic traffic show that our design can improve network throughput by 11.3% on average and achieve significant power-saving in low- and medium-load regions. In the SPLASH-2 workload simulation, our design can save on average 27.2% of total power compared to the baseline, and decrease 42.8% average latency compared to the baseline with power-gating.

international conference on electronics, communications, and computers | 2015

UniMESH: The light-weight unidirectional channel Network-on-Chip in 2D mesh topology

Feng Wang; Xiantuo Tang; Zuocheng Xing; Hengzhu Liu

Power consumption, design complexity and areacost are limiting constraints in the design of interconnect for scalable many-core systems. To tackle the power and area concerns, we propose a light-weight unidirectional channel network-on-chip in 2D mesh topology (UniMESH), which simplifies router architectures, uses only half amount of channel links to guarantee a fully connected topology, and adopts a novel routing algorithm and deadlock recovery mechanism. As a result, it can reduce both design complexity and area-cost, and decrease some unwanted power consumption. Evaluations show that the proposed light-weight UniMESH can reduce 57.4% router areas, and save 39.3% total power consumption and only add few extra latency when compared with conventional 2D mesh design in SPLASH application simulations.

Microelectronics Journal | 2015

Applying partial power-gating to bit-sliced network-on-chip

Feng Wang; Xiantuo Tang; Zuocheng Xing

In the many-core systems, network-on-chip (NoC) serves as an efficient and scalable architecture to connect numerous on-chip resources, whereas it encounters the crisis of the increasing leakage power as technology is continually scaling down. Power-gating which is a representative low-power technique can be utilized to mitigate the increasing leakage power, but the disconnection problem suffered in the conventional power-gated NoC may severely affect network performance. In this paper, we propose a novel partial power-gating approach to avoid the performance loss caused by the disconnection. Firstly, we utilize the asymmetrical bit-slicing scheme to split router into two slices. After the bit-slicing of router datapath, the wide slices can be switched off to save some leakage power by using partial power-gating, but all narrow slices should be kept in ever-active state to avoid the disconnection. Next, owing to the slicing of router datapath, we redefine the packet format for the packets slicing and transferring, and present two essential conversion modules to achieve packets slicing and reassembling. In the synthetic traffic simulation, our design gains considerable power-saving at low-load and exhibits better performance behavior than the conventional power-gated design. The application simulation shows that our design can averagely save 27.5% of total power compared with the baseline design, and reduce 45.0% packet latency on average when compared with the conventional power-gated design. On balance, the bit-sliced NoC with partial power-gating has a better tradeoff between performance and power-efficiency. Graphical abstractWe propose a novel partial power-gating approach to avoid the performance loss, and utilize the asymmetrical bit-slicing scheme to split router into two slices as shown in Fig. 1(a). The wide slices can be switched off to save some leakage power by using partial power-gating, but all narrow slices can be kept in ever-active state to avoid the disconnection as shown in Fig. 1(b). Bit-sliced NoC with partially power-gating. (a) Router. (b) Topology.Display Omitted The asymmetrical bit-slicing scheme is utilized to slice router datapath.Power-gating is only applied to partial channel bits of each sliced router.Packet format is redefined to support the packet slicing.Two conversion modules are added to achieve packets slicing and reassembling.

CCF National Conference on Compujter Engineering and Technology | 2015

A ML-Based High-Accuracy Estimation of Sampling and Carrier Frequency Offsets for OFDM Systems

Cang Liu; Luechao Yuan; Zuocheng Xing; Xiantuo Tang; Guitao Fu

This paper addresses the problem of acquiring the sampling frequency offset (SFO) and carrier frequency offset (CFO), which severely degrade the performance of orthogonal frequency division multiplexing (OFDM) system. Using two identical frequency domain (FD) long training symbols in preamble, we propose a novel maximum-likelihood (ML) estimation method to simultaneously acquire the values of SFO and CFO, which extend the Kim’s and Wang’s estimation methods. The main contribution of this paper is that the first-order Legendre series expansion is used to obtain the SFO and CFO values in closed-form. For obtaining the performance of the proposed estimation scheme, we built the OFDM system model according to IEEE 802.11a. The results show that the proposed scheme achieves the best performance to the existing schemes.

CCF National Conference on Compujter Engineering and Technology | 2013

Backhaul-Route Pre-Configuration Mechanism for Delay Optimization in NoCs

Xiantuo Tang; Feng Wang; Zuocheng Xing; Qinglin Wang

The paper proposes a backhaul-route pre-configuration mechanism (BRPCM) for the round-trip communication pattern, which is suited for the backhaul packets traversal. With previous communication patterns, BRPCM pre-configures a converse crossbar connection creating backhaul-route within a single router during the previous flits traversal. Combining with appropriate route reuse and termination mechanism, the subsequent packets satisfied with the comparative conditions are expected to reuse the backhaul-route and directly forward to crossbar without SA stage, and hence to reduce the average latency for packets traversal. Our evaluation with traces from Splash-2 Benchmark shows the average performance improvement for BRPCM can be achieved by up to 53.5%, 40.1% and 16.4% respectively compared to the BASE, BASE_LR, BASE_LR_SPC routers. Evaluated with synthetic workload traffic, BRPCM shows performance improvement by up to 51.5%, 36.3% and 10.2% at most while compared to the BASE, BASE_LR and BASE_LR_SPC router under the Uniform-random, Bit-reverse, Shuffle and Transpose traffic mode at the low-load traffic.

parallel computing in electrical engineering | 2011

Design and Evaluation of Traffic Filter for Token Protocol

Guitao Fu; Zuocheng Xing; Tianlei Zhao; Xiantuo Tang

Cache coherence protocols play an important role in maintaining data coherence in shared-memory multiproces-sor. Token protocol provides a flexible framework for designing new coherence protocols. It features in both attributes: low-latency cache misses and no reliance on totally-ordered inter-connects. However, messages in token protocol are alwaysbroadcasted, which limits the scalability of token-based protocol. In this paper, a traffic filter is proposed to reduce the net-work traffic of token protocol. It records the information of the recently used blocks. When a miss happens, the requested blockis checked in the traffic filter, and broadcasting can be avoided if existing. With traffic filter, GETS requests are serviced by the owner node, and GETX requests are send to all the sharers. Thus only nodes holding tokens are accessed and broadcast avoided, which reduce network traffic. Experiment results show that overall, for TF256 and TF1024, the interconnect trafficis reduced by average of 34.3% and 27.9% respectively, the endpoint traffic is reduced by average of 32.6% and 26.7%respectively. Our experimentation also shows that TF256performs better than TF1024 for some applications.

Advances in Difference Equations | 2016

An efficient parallel algorithm for Caputo fractional reaction-diffusion equation with implicit finite-difference method

Qinglin Wang; Jie Liu; Chunye Gong; Xiantuo Tang; Guitao Fu; Zuocheng Xing

Archive | 2012

On-chip cache structure used for variable memory access mode of general-purpose stream processor

Zuocheng Xing; Guitao Fu; Xiaobao Chen; Anguo Ma; Ping Huang; Xiantuo Tang; Rui He; Qinglin Wang; Xiaobo Yan; Fangyuan Li; Jianxiong Qiu; Fang Cai; Yinpi Min; Jiaxiang Mei; Xiaodong Meng; Qi Zhao; Hongyan Wang

Explore More