Xiaochun Ye | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaochun Ye is active.

Explore More

Publication

Featured researches published by Xiaochun Ye.

international parallel and distributed processing symposium | 2010

High performance comparison-based sorting algorithm on many-core GPUs

Xiaochun Ye; Dongrui Fan; Wei Lin; Nan Yuan; Paolo Ienne

Sorting is a kernel algorithm for a wide range of applications. We present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30% higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.

Journal of Computer Science and Technology | 2009

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Dongrui Fan; Nan Yuan; Junchao Zhang; Yongbin Zhou; Wei Lin; Fenglong Song; Xiaochun Ye; He Huang; Lei Yu; Guoping Long; Hao Zhang; Lei Liu

Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

international symposium on microarchitecture | 2012

Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism

Dongrui Fan; Hao Zhang; Da Wang; Xiaochun Ye; Fenglong Song; Guojie Li; Ninghui Sun

Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously. It also has many features to achieve high efficiency for on-chip resource utilization, such as a region-based cache coherence protocol, data transfer agents, and hardware-supported synchronization mechanisms. Finally, it also features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable.

computer and information technology | 2009

A Low-Complexity Synchronization Based Cache Coherence Solution for Many Cores

Wei Lin; Dongrui Fan; He Huang; Nan Yuan; Xiaochun Ye

Computer architectures make a dramatic turn away from improving single-processor performance towards improved parallel performance through integrating many cores in one chip. However, providing directory based coherence protocols for these platforms is too complex and expensive. As a substitute, we propose a synchronization based cache coherence solution, which uses different cache policies according to three flexible software guided scopes (exclusion, producer and consumer scopes) to solve the data-race and coherence problem. Furthermore, this protocol implements word dirty bits (only one bit per word is needed in each L1 cache line) and a special hardware synchronization manager (HSM) to support multi-writer, the write-validate policy and to eliminate the remote un-cache spinning at the “flag”. We evaluate Godson-T, a platform supporting this protocol, against an idealized interleaved dual tag directory based protocol on the Splash2 and two bioinformatics benchmarks. The performance of the synch-based protocol is degraded only at 3.2% on average, but the real chip area is reduced at 32% even if the overhead of HSM is included.

international symposium on low power electronics and design | 2013

SimICT: a fast and flexible framework for performance and power evaluation of large-scale architecture

Xiaochun Ye; Dongrui Fan; Ninghui Sun; Shibin Tang; Mingzhe Zhang; Hao Zhang

Simulation is an important method to evaluate future computer systems. However, the increasing complexity of the target systems has made the development of simulators very difficult. Furthermore, detailed simulation of large-scale parallel architecture is so slow that full evaluation of real application becomes a great challenge. This paper presents SimICT, a fast and flexible simulation framework which aims at performance and power evaluation for large-scale architecture. SimICT uses component-based design to improve its flexibility of building target systems. It also introduces an automatic parallel mechanism with relaxed synchronization to speed up the simulation. Finally, it provides a graphic configuration interface to ease the use difficulty. Based on this framework, various existing models, such as performance and power modeling tools, can be integrated to produce a holistic simulation platform.

international conference on parallel and distributed systems | 2012

Auto-Tuning GEMV on Many-Core GPU

Weizhi Xu; Zhiyong Liu; Jun Wu; Xiaochun Ye; Shuai Jiao; Da Wang; Fenglong Song; Dongrui Fan

GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lead to very different performances. The matrix-vector multiplication routine for general dense matrices (GEMV) is a building block for many scientific and engineering computations. We find that the implementations of GEMV in CUBLAS 4.0 or MAGMA are not efficient, especially for small matrix or fat matrix (a matrix with small number of rows and large number of columns). In this paper, we propose two new algorithms to optimize GEMV on Fermi GPU. Instead of using only one thread, we use a warp to compute an element of vector y. We also propose a novel register blocking method to accelerate GEMV on GPU further. The proposed optimization methods for GEMV are comprehensively evaluated on the matrices with different sizes. Experiment results show that the new methods can achieve over 10x speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or MAGMA, and the new register blocking method can also perform better than CUBLAS 4.0 or MAGMA for large square matrices. We also propose a performance-tuning framework on how to choose an optimal algorithm of GEMV for an arbitrary input matrix on GPU.

Journal of Computer Science and Technology | 2017

An Efficient Network-on-Chip Router for Dataflow Architecture

Xiaowei Shen; Xiaochun Ye; Xu Tan; Da Wang; Lunkai Zhang; Wenming Li; Zhimin Zhang; Dongrui Fan; Ninghui Sun

Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.

computer and information technology | 2009

A Fast Linear-Space Sequence Alignment Algorithm with Dynamic Parallelization Framework

Xiaochun Ye; Dongrui Fan; Wei Lin

Exact pairwise sequence alignment algorithms using dynamic programming require quadratic space and time, and this makes these algorithms impractical for large-scale sequences. In this paper, we propose and evaluate a new Anti-Diagonal based Parallel Linear-Space Algorithm (AD-PLSA). It records similarity matrix scores and start points on special anti-diagonals instead of special rows or columns. This algorithm is able to further reduce the total amount of re-computation. It also produces balanced sub-problems with approximately even size, and this is of great benefit to the parallelization. In addition, we establish a dynamic parallelization framework for the efficient acceleration on a symmetric multiprocessor (SMP) platform. The experimental results present liner speedup for long real DNA sequences. Compared with the typical row-col algorithm, our method is able to save more than 30% of the re-computation, and run twice as fast in the backward phase.

parallel and distributed computing: applications and technologies | 2008

Efficient Parallelization of a Protein Sequence Comparison Algorithm on Manycore Architecture

Xiaochun Ye; Van Hoa Nguyen; Dominique Lavenier; Dongrui Fan

This paper introduces the Godson-T manycore architecture and demonstrates the efficiency of its synchronization mechanism through a computation intensive bioinformatics application: the comparison of protein banks. The parallel part of the protein sequence comparison algorithm can nearly get a linear speed-up thanks to a fine tuning of the synchronization mechanism provided by the Godson-T chip.

Journal of Computer Science and Technology | 2018

A Non-Stop Double Buffering Mechanism for Dataflow Architecture

Xu Tan; Xiaowei Shen; Xiaochun Ye; Da Wang; Dongrui Fan; Lunkai Zhang; Wenming Li; Zhimin Zhang; Zhimin Tang

Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataflow architecture achieves a 16.2% average efficiency improvement over that without the optimization.

Explore More