Is this you? Create Your Porfile

Nan Yuan

Chinese Academy of Sciences

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nan Yuan is active.

Explore More

Publication

Featured researches published by Nan Yuan.

international parallel and distributed processing symposium | 2010

High performance comparison-based sorting algorithm on many-core GPUs

Xiaochun Ye; Dongrui Fan; Wei Lin; Nan Yuan; Paolo Ienne

Sorting is a kernel algorithm for a wide range of applications. We present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30% higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.

Journal of Computer Science and Technology | 2009

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Dongrui Fan; Nan Yuan; Junchao Zhang; Yongbin Zhou; Wei Lin; Fenglong Song; Xiaochun Ye; He Huang; Lei Yu; Guoping Long; Hao Zhang; Lei Liu

Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

european conference on parallel processing | 2009

High Performance Matrix Multiplication on Many Cores

Nan Yuan; Yongbin Zhou; Guangming Tan; Junchao Zhang; Dongrui Fan

Moores Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level parallelism exploited by multi/many-core processors (MCP). Therefore, it is necessary to find out how to build the MCP hardware and how to program the parallelism on such MCP. In this work, we intend to identity the key architecture mechanisms and software optimizations to guarantee high performance for multithreaded programs. To illustrate this, we customize a dense matrix multiplication algorithm on Godson-T MCP as a case study to demonstrate the efficient synergy and interaction between hardware and software. Experiments conducted on the cycle-accurate simulator show that the optimized matrix multiplication could obtain 97.1% (124.3GFLOPS) of the peak performance of Godson-T.

international conference on software engineering | 2009

Study on Fine-Grained Synchronization in Many-Core Architecture

Lei Yu; Zhiyong Liu; Dongrui Fan; Fenglong Song; Junchao Zhang; Nan Yuan

The synchronization between threads has serious impact on the performance of many-core architecture. When communication is frequent, coarse-grained synchronization brings significant overhead. Thus, coarse-grained synchronization is not suitable for this situation. However, the overhead of fine-grained synchronization is still small when the communication is frequent. For the many-core architecture which supports fine-grained synchronization with on-chip storage, we propose fine-grained synchronization algorithms for scientific computation application 2-D wavefront and LU decomposition. At first, according to the memory access mode, an efficient method of data allocation is proposed. Then, way of thread partition and synchronization are discussed. Finally, we estimate the two algorithms based on Godson-T many-core architecture. The results of experiments show that the relative speedup is almost linear and the execution time is only 53.2 % of the coarse-grained synchronization. After the global barriers are eliminated, LU decomposition achieved 13.1% performance improvement. Moreover, the experiments prove that the fine-grained mechanism is able to improve the performance of processor and it has a good scalability.

computer and information technology | 2009

A Low-Complexity Synchronization Based Cache Coherence Solution for Many Cores

Wei Lin; Dongrui Fan; He Huang; Nan Yuan; Xiaochun Ye

Computer architectures make a dramatic turn away from improving single-processor performance towards improved parallel performance through integrating many cores in one chip. However, providing directory based coherence protocols for these platforms is too complex and expensive. As a substitute, we propose a synchronization based cache coherence solution, which uses different cache policies according to three flexible software guided scopes (exclusion, producer and consumer scopes) to solve the data-race and coherence problem. Furthermore, this protocol implements word dirty bits (only one bit per word is needed in each L1 cache line) and a special hardware synchronization manager (HSM) to support multi-writer, the write-validate policy and to eliminate the remote un-cache spinning at the “flag”. We evaluate Godson-T, a platform supporting this protocol, against an idealized interleaved dual tag directory based protocol on the Splash2 and two bioinformatics benchmarks. The performance of the synch-based protocol is degraded only at 3.2% on average, but the real chip area is reduced at 32% even if the overhead of HSM is included.

european conference on parallel processing | 2008

A Performance Model of Dense Matrix Operations on Many-Core Architectures

Guoping Long; Dongrui Fan; Junchao Zhang; Fenglong Song; Nan Yuan; Wei Lin

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number P max . We show that

international symposium on parallel and distributed processing and applications | 2009

A Synchronization-Based Alternative to Directory Protocol

He Huang; Lei Liu; Nan Yuan; Wei Lin; Fenglong Song; Junchao Zhang; Dongrui Fan

P_{max}=\Theta(B\ast \sqrt{C})

computer and information technology | 2009

Design of New Hash Mapping Functions

Fenglong Song; Zhiyong Liu; Dongrui Fan; Junchao Zhang; Lei Yu; Nan Yuan; Wei Lin

. P max indicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores P max . The model is validated by a comparison between the theoretical performance and experimental data of previous works.

international symposium on parallel and distributed processing and applications | 2009

Data Management: The Spirit to Pursuit Peak Performance on Many-Core Processor

Yongbin Zhou; Junchao Zhang; Shuai Zhang; Nan Yuan; Dongrui Fan

The efficient support of cache coherence is extremely important to design and implement many-core processors. In this paper, we propose a synchronization-based coherence (SBC) protocol to efficiently support cache coherence for shared memory many-core architectures. The unique feature of our scheme is that it doesn’t use directory at all. Inspired by scope consistency memory model, our protocol maintains coherence at synchronization point. Within critical section, processor cores record write-sets (which lines have been written in critical section) with bloom-filter function. When the core releases the lock, the write-set is transferred to a synchronization manager. When another core acquires the same lock, it gets the write-set from the synchronization manager and invalidates stale data in its local cache. Experimental results show that the SBC outperforms by averages of 5% in execution time across a suite of scientific applications. At the mean time, the SBC is more cost-effective comparing to directory-based protocol that requires large amount of hardware resource and huge design verification effort.

international symposium on parallel and distributed processing and applications | 2009

Evaluation Method of Synchronization for Shared-Memory On-Chip Many-Core Processor

Fenglong Song; Zhiyong Liu; Dongrui Fan; He Huang; Nan Yuan; Lei Yu; Junchao Zhang

Conflict can decrease performance of computer severely, such as bank conflicts reduce bandwidth of interleave multibank memory systems and conflict misses reduce effective on-chip capacity, and this incurs much conflict miss further. Conflicts can be avoided by a suitable address mapping scheme which maps the most frequently occurring patterns conflict-free. In this paper, we present a new XOR-based mapping scheme, called XORM, which focuses on multi-bank shared cache of on-chip many-core architecture. The XORM mapping scheme can map arbitrary bits of address to set indices and compute each set index bit as XOR of a transpositional subset of the bits in the address. Then, we analyze necessary characteristics of an address mapping scheme to avoid conflict in many-core architecture. Next, we give a case study to design optimal hash functions based on XORM scheme for skewed-associative cache. Finally, we introduce another case study, in which we illustrate how to design an XORM mapping scheme with lower implementation cost, complexity and computing latency in shared cache of chipped many-core architecture. The evaluated results show the effectiveness of XORM mapping scheme.

Explore More