Xingsheng Tang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xingsheng Tang is active.

Explore More

Publication

Featured researches published by Xingsheng Tang.

computer and information technology | 2010

An Efficient Power-Aware Optimization for Task Scheduling on NoC-based Many-core System

Wei Hu; Xingsheng Tang; Bin Xie; Tianzhou Chen; Dazhou Wang

With the development of the semiconductor industry, more processors can be integrated onto a single chip. Network-on-Chip (NoC) is an efficient solution for the interconnections on chip for many-core system with many processor cores on chip. However, enhancing performance with lower power consumption is still a challenge. The core issue is the mapping of applications to NoC. A common method is to find processes with high communication with each other and map them to neighborhoods. Thus, they can reduce the communication distance and avoid unnecessary energy cost. This work proposed an online scheduling method, which aims at the optimization of task scheduling algorithm with low communication energy consumption. The communication status of applications at run time is analyzed first. Then, the algorithm will compute the mapping method dynamically and implement the real-time scheduling online. Experimental results based on simulation show that the algorithm proposed in this review can achieve more than 30% communication energy saving with low complexity.

The Journal of Supercomputing | 2013

An energy-aware online task mapping algorithm in NoC-based system

Bin Xie; Tianzhou Chen; Wei Hu; Xingsheng Tang; Dazhou Wang

With the development of the semiconductor technology, more processors can be integrated onto a single chip. Network-on-Chip is an efficient communication solution for many-core system. However, enhancing performance with lower energy consumption is still a challenge. One critical issue is mapping applications to NoC. This work proposed an online mapping method, which optimizes task mapping algorithm to reduce communication energy consumption. The communication status of applications at runtime is analyzed first. Then, the algorithm computes the mapping placement dynamically and implements the real-time mapping online. Experimental results based on simulation show that the algorithm proposed in this article can achieve more than 20% communication energy saving compared with first fit mapping and nearest neighbor mapping. The migration cost caused by the remapping process is also considered, and can be calculated at the runtime to estimate the effect of remapping.

international conference on green computing and communications | 2011

PeRex: A Power Efficient FPGA-based Architecture for Regular Expression Matching

Yuan Wen; Xingsheng Tang; Lihan Ju; Tianzhou Chen

Regular expression is an important approach which is widely used in string pattern matching. And in many pragmatic applications string pattern matching is the most compute intensive task which takes majority processing time, therefore, in order to improve system efficiency many works have been done around hardware implementation of regular expression matching. However, the traditional design approaches pay more attention on the implementation methods as well as their efficiency than the power consumption. In this paper we provide a power efficient regular expression matching architecture (PeRex). By taking full use of both rising and trailing edges of the system clocks such architecture is able to match two characters in a single system cycle. So, by maintaining the high performance and throughput the architecture in this paper is able to work in a lower clock frequency, consequently it will decrease the dynamic power consumption remarkably. Analyzed by XPower, which offered by Xilinx Inc., our approach is able to save dynamic power consumption by1.7 times comparing to traditional approaches on Virtex-V XC5VLX30 FPGA device.

Journal of Systems Architecture | 2014

Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling

Licheng Yu; Xingsheng Tang; Minghui Wu; Tianzhou Chen

Abstract General-purpose graphics processing unit (GPGPU) plays an important role in massive parallel computing nowadays. A GPGPU core typically holds thousands of threads, where hardware threads are organized into warps. With the single instruction multiple thread (SIMT) pipeline, GPGPU can achieve high performance. But threads taking different branches in the same warp violate SIMD style and cause branch divergence. To support this, a hardware stack is used to sequentially execute all branches. Hence branch divergence leads to performance degradation. This article represents the PDOM (post dominator) stack as a binary tree, and each leaf corresponds to a branch target. We propose a new PDOM stack called PDOM-ASI, which can schedule all the tree leaves. The new stack can hide more long operation latencies with more schedulable warps without the problem of warp over-subdivision. Besides, a multi-level warp scheduling policy is proposed, which lets part of the warps run ahead and creates more opportunities to hide the latencies. The simulation results show that our policies achieve 10.5% performance improvements over baseline policies with only 1.33% hardware area overhead.

computer and information technology | 2010

Network Main Memory Architecture for NoC-Based Chips

Xingsheng Tang; Binbin Wu; Tianzhou Chen; Wei Hu; Jiexiang Kang; Zhenwei Zheng

Network on Chip (NoC) is considered to be the best candidate for future on-chip communication; however, with the increase in the number of on-chip processors, the simultaneous memory accesses of these processors can cause serious main memory bottleneck problem. In this study, we have proposed the concept of Network Main Memory (NMM). NMM has distributed network architecture for main memory and multicommunication channels to NoC chips, which can overcome the main memory bottleneck problem. When compared with traditional memory, the bandwidth of NMM can be sufficiently used owing to the network architecture, and it is convenient to increase the memory bandwidth. Our experimental results on simulator show that our NMM can provide better traffic for NoCs. In addition, management of NMM as well as the software model for NoC chips and NMM have also been discussed.

ieee international conference on high performance computing data and analytics | 2012

Improve GPGPU Latency Hiding with a Hybrid Recovery Stack and a Window Based Warp Scheduling Policy

Tianzhou Chen; Xingsheng Tang; Licheng Yu; Jianliang Ma; Minghui Wu

Branch divergence phenomenon usually has very serious impact on SIMD pipelines efficiency. However Dynamic Warp Subdivisions branch method utilizes the branch divergence phenomenon to hide memory latency by interleaving issue among all branch paths of a warp. But this method may experience serious over-subdivision problem. So, we propose a hybrid stack mechanism that enables the PDOM stack can issue any ready sub-warps without losing the logical structure of PDOM stack. To maximize our hybrid stacks potential we propose a window based scheduling policy to reinforce the memory latency hiding. The experiment result shows that our window based scheduling policy and the hybrid stack hardwares combination can improve the performance by 10% compared with the baseline configuration with PDOM loose round-robin method and 6.8% over DWS-PC with our window based scheduling policy in our selected 7 benchmark programs.

international conference on algorithms and architectures for parallel processing | 2010

Single thread program parallelism with dataflow abstracting thread

Tianzhou Chen; Xingsheng Tang; Jianliang Ma; Lihan Ju; Guanjun Jiang; Qingsong Shi

CMP brings more benefits comparing with uni-core processor, but CMP is not fit for legacy code well because legacy code bases on uni-core processor. This paper presents a novel Thread Level Parallelism technology called Dataflow Abstracting Thread (DFAT). DFAT builds a United Dependence Graph (UDG) for the program and decouples single thread into many threads which can run on CMP parallelly. DFAT analyzes the programs data-, control- and anti-dependence and gets a dependence graph, then dependences are combined and be added some attributes to get a UDG. The UDG decides instructions execution order, and according to this, instructions can be assigned to different thread one by one. An algorithm decides how to assign those instructions. DFAT considers both communication overhead and thread balance after the original thread division. Thread communication in DFAT is implemented by producer-consumer model. DFAT can automatically abstract multi-thread from single thread and be implemented in compiler. In our evaluation, we decouple single thread into at most 8 threads with DFAT and the result shows that decoupling single thread into 4-6 threads can get best benefits.

Archive | 2011