Dongrui Fan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dongrui Fan is active.

Explore More

Publication

Featured researches published by Dongrui Fan.

international parallel and distributed processing symposium | 2010

High performance comparison-based sorting algorithm on many-core GPUs

Xiaochun Ye; Dongrui Fan; Wei Lin; Nan Yuan; Paolo Ienne

Sorting is a kernel algorithm for a wide range of applications. We present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30% higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.

Journal of Computer Science and Technology | 2009

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Dongrui Fan; Nan Yuan; Junchao Zhang; Yongbin Zhou; Wei Lin; Fenglong Song; Xiaochun Ye; He Huang; Lei Yu; Guoping Long; Hao Zhang; Lei Liu

Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

international symposium on microarchitecture | 2015

Enabling coordinated register allocation and thread-level parallelism optimization for GPUs

Xiaolong Xie; Yun Liang; Xiuhong Li; Yudong Wu; Guangyu Sun; Tao Wang; Dongrui Fan

The key to the high performance on GPUs lies in the massive threading to enable thread switching and hide long latencies. GPUs are equipped with a large register file to enable fast context switch. However, thread throttling techniques that are designed to mitigate cache contention, lead to under-utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP. In this paper, we propose Coordinated Register Allocation and Thread-level parallelism (CRAT ) to explore the optimization space of register allocation and TLP management on GPUs. CRAT employs both compile-time(CRAT-static) and run-time techniques(CRAT-dyn) to exhaust the design space. CRAT-static works statically to explore TLP and register allocation trade-off and CRAT-dyn exploits dynamic register allocation for further improvement. Experiments indicate that CRAT-static achieves an average 1.25X speedup over existing TLP management technique. On four register-limited applications, CRAT-dyn further improves the performance speedup of CRAT-static from 1.51X to 1.70X.

workshop on parallel and distributed simulation | 2010

P-GAS: Parallelizing a Cycle-Accurate Event-Driven Many-Core Processor Simulator Using Parallel Discrete Event Simulation

Huiwei Lv; Yuan Cheng; Lu Bai; Mingyu Chen; Dongrui Fan; Ninghui Sun

Multi-core processors are commonly available now, but most traditional computer architectural simulators still use single-thread execution. In this paper we use parallel discrete event simulation (PDES) to speedup a cycle-accurate event-driven many-core processor simulator. Evaluation against the sequential version shows that the parallelized one achieves an average speedup of 10.9× (up to 13.6×) running SPLASH-2 kernel on a 16-core host machine, with cycle counter differences of less than 0.1%. Moreover, super-linear speedups are achieved between running 1 thread and 8 threads due to reduced overhead of insert-event-to-queue time and increased cache size in parallel processing. We conclude that PDES could be an attractive option for achieving fast cycle-accurate many-core processor simulations.

international conference on parallel and distributed systems | 2008

A Quantitative Study of the On-Chip Network and Memory Hierarchy Design for Many-Core Processor

Xu Wang; Ge Gan; Joseph B. Manzano; Dongrui Fan; Shuxu Guo

In this paper, we will study the on-chip network and memory hierarchy design of the Godson-T - a homogeneous many-core processor. Godson-T has 64 cores (with private L1 cache), and 16 global L2 cache banks. All these on-chip units are connected by a 2D 8 × 8 mesh network. Our study reveals that:(a) Global on-chip L2 cache can effectively alleviate the memory pressure caused by the data-thirsty on-chip computing engines. However, its potential is still limited by both the off-chip and the in-chip bandwidth, especially when increasing the number of active threads.(b) On-chip traffic congestion is largely caused by the intensive memory access requests issued from the on-chipcores. Therefore, the design of the on-chip network must consider the available performance of the datapath that connects the processor to the main memory. (c) In theory, different applications have different communication patterns (Berkeleys view). However, the applications runtime communication pattern is only determined by the design of the underlying memory hierarchy and on-chip interconnection. These conclusions are generally applicable to a wide variety of many-core processors with similar design.

international conference on parallel architectures and compilation techniques | 2014

SpongeDirectory: flexible sparse directories utilizing multi-level memristors

Lunkai Zhang; Dmitri B. Strukov; Hebatallah Saadeldeen; Dongrui Fan; Mingzhe Zhang; Diana Franklin

Cache-coherent shared memory is critical for programmability in many-core systems. Several directory-based schemes have been proposed, but dynamic, non-uniform sharing make efficient directory storage challenging, with each giving up storage space, performance or energy. We introduce SpongeDirectory, a sparse directory structure that exploits multi-level memristory technology. SpongeDirectory expands directory storage in-place when needed by increasing the number of bits stored on a single memristor device, trading latency and energy for storage. We explore several SpongeDirectory configurations, finding that a provisioning rate of 0.5× with memristors optimized for low energy consumption is the most competitive. This optimal SpongeDirectory configuration has performance comparable to a conventional sparse directory, requires 18× less storage space, and consumes 8× less energy.

international symposium on microarchitecture | 2012

Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism

Dongrui Fan; Hao Zhang; Da Wang; Xiaochun Ye; Fenglong Song; Guojie Li; Ninghui Sun

Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously. It also has many features to achieve high efficiency for on-chip resource utilization, such as a region-based cache coherence protocol, data transfer agents, and hardware-supported synchronization mechanisms. Finally, it also features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable.

IEEE Transactions on Parallel and Distributed Systems | 2016

An Evolutionary Technique for Performance-Energy-Temperature Optimized Scheduling of Parallel Tasks on Multi-Core Processors

Hafiz Fahad Sheikh; Ishfaq Ahmad; Dongrui Fan

This paper proposes a multi-objective evolutionary algorithm (MOEA)-based task scheduling approach for determining Pareto optimal solutions with simultaneous optimization of performance (P), energy (E), and temperature (T). Our algorithm includes problem-specific solution encoding, determining the initial population of the solution space, and the genetic operators that collectively work on generating efficient solutions in fast turnaround time. Multiple schedules offer a diverse range of values for makespan, energy consumed, and peak temperature and thus present an efficient way of identifying trade-offs among the desired objectives, for a given application and machine pair. We also present a methodology for selecting one solution from the Pareto front given the users preference. The proposed algorithm for scheduling tasks to cores achieves three-way optimization with fast turnaround time. The proposed algorithm is advantageous because it reduces both energy and temperature together rather than in isolation. We evaluate the proposed algorithm using implementation and simulation, and compare it with integer linear programming as well as with other scheduling algorithms that are energyor thermal-aware. The time complexity of the proposed scheme is considerably better than the compared algorithms.

Journal of Lightwave Technology | 2014

A Novel Two-Layer Passive Optical Interconnection Network for On-Chip Communication

Ke Chen; Huaxi Gu; Yintang Yang; Dongrui Fan

Passive optical interconnection network (OIN) plays a key role in optical Network-on-Chip (ONoC) architecture. Existing passive OINs based on wavelength division multiplexing (WDM) are popularly employed. However, the scalability of these passive OINs is limited by the number of wavelengths and large insertion loss induced by the waveguide crossings. In this paper, we propose a novel Passive OIN based on two-layer architecture, POINT, for ONoC architecture. POINT leverages space division multiplexing (SDM) to assist WDM in eliminating blocking. The inter-layer communication in POINT relies on the inter-layer coupler, which contributes to reduce crossing losses. POINT features a modular and scalable design, in which the proposed SDM-based cell (SBC) is used as the basic building block to construct POINT with efficient wavelength assignment. Furthermore, SBCs of different sizes provide different options for constructing POINT. Comparisons with existing passive OINs confirm that POINT can provide an optimal choice with the balance between the number of wavelengths, area overhead, and insertion loss for the same size.

Journal of Lightwave Technology | 2014

A Hierarchical Optical Network-On-Chip Using Central-Controlled Subnet and Wavelength Assignment

Zheng Chen; Huaxi Gu; Yintang Yang; Dongrui Fan

Optical network-on-chip (ONoC) is a promising alternative to be served as the fundamental architecture for future many-core system. However, several problems of ONoC, such as power consumption, arbitration overhead, and device cost, pose many limitations to the architecture design. In this paper, a novel hierarchical ONoC structure named CWNoC is proposed, which is a 256-core architecture composed of multiple central-controlled subnets. It reduces the network complexity by dividing the whole network into several subnets and lowers the arbitration overhead by adopting centralized arbitration logic in each subnet. An efficient wavelength assignment method, making full use of broadband microring resonators, is also employed in CWNoC, which facilitates simplifying the optical layer and reducing the possibility of contention. The simulation results show that CWNoC has a better latency and power consumption performance. For example, when low and medium load is applied, the latency reduction can be as much as 40 ns compared with WANoC, while the total power consumption is reduced by 70%.

Explore More