Fenglong Song | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fenglong Song is active.

Explore More

Publication

Featured researches published by Fenglong Song.

Journal of Computer Science and Technology | 2009

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Dongrui Fan; Nan Yuan; Junchao Zhang; Yongbin Zhou; Wei Lin; Fenglong Song; Xiaochun Ye; He Huang; Lei Yu; Guoping Long; Hao Zhang; Lei Liu

Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

software engineering, artificial intelligence, networking and parallel/distributed computing | 2012

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

Weizhi Xu; Hao Zhang; Shuai Jiao; Da Wang; Fenglong Song; Zhiyong Liu

It is an important task to tune performance for sparse matrix vector multiplication (SpMV), but it is also a difficult task because of its irregularity. In this paper, we propose a cache blocking method to improve the performance of SpMV on the emerging GPU architecture. The sparse matrix is partitioned into many sub-blocks, which are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time spent on accessing the global memory for vector x is reduced heavily. Experimental results on GeForce GTX 480 show that SpMV kernel with the cache blocking method is 5x faster than the unblocked CSR kernel in the best case.

international symposium on microarchitecture | 2012

Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism

Dongrui Fan; Hao Zhang; Da Wang; Xiaochun Ye; Fenglong Song; Guojie Li; Ninghui Sun

Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously. It also has many features to achieve high efficiency for on-chip resource utilization, such as a region-based cache coherence protocol, data transfer agents, and hardware-supported synchronization mechanisms. Finally, it also features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable.

international conference on software engineering | 2009

Study on Fine-Grained Synchronization in Many-Core Architecture

Lei Yu; Zhiyong Liu; Dongrui Fan; Fenglong Song; Junchao Zhang; Nan Yuan

The synchronization between threads has serious impact on the performance of many-core architecture. When communication is frequent, coarse-grained synchronization brings significant overhead. Thus, coarse-grained synchronization is not suitable for this situation. However, the overhead of fine-grained synchronization is still small when the communication is frequent. For the many-core architecture which supports fine-grained synchronization with on-chip storage, we propose fine-grained synchronization algorithms for scientific computation application 2-D wavefront and LU decomposition. At first, according to the memory access mode, an efficient method of data allocation is proposed. Then, way of thread partition and synchronization are discussed. Finally, we estimate the two algorithms based on Godson-T many-core architecture. The results of experiments show that the relative speedup is almost linear and the execution time is only 53.2 % of the coarse-grained synchronization. After the global barriers are eliminated, LU decomposition achieved 13.1% performance improvement. Moreover, the experiments prove that the fine-grained mechanism is able to improve the performance of processor and it has a good scalability.

european conference on parallel processing | 2008

A Performance Model of Dense Matrix Operations on Many-Core Architectures

Guoping Long; Dongrui Fan; Junchao Zhang; Fenglong Song; Nan Yuan; Wei Lin

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number P max . We show that

international conference on parallel and distributed systems | 2012

Auto-Tuning GEMV on Many-Core GPU

Weizhi Xu; Zhiyong Liu; Jun Wu; Xiaochun Ye; Shuai Jiao; Da Wang; Fenglong Song; Dongrui Fan

P_{max}=\Theta(B\ast \sqrt{C})

Journal of Parallel and Distributed Computing | 2013

Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Liu Peng; Guangming Tan; Rajiv K. Kalia; Aiichiro Nakano; Priya Vashishta; Dongrui Fan; Hao Zhang; Fenglong Song

. P max indicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores P max . The model is validated by a comparison between the theoretical performance and experimental data of previous works.

international symposium on parallel and distributed processing and applications | 2009

A Synchronization-Based Alternative to Directory Protocol

He Huang; Lei Liu; Nan Yuan; Wei Lin; Fenglong Song; Junchao Zhang; Dongrui Fan

GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lead to very different performances. The matrix-vector multiplication routine for general dense matrices (GEMV) is a building block for many scientific and engineering computations. We find that the implementations of GEMV in CUBLAS 4.0 or MAGMA are not efficient, especially for small matrix or fat matrix (a matrix with small number of rows and large number of columns). In this paper, we propose two new algorithms to optimize GEMV on Fermi GPU. Instead of using only one thread, we use a warp to compute an element of vector y. We also propose a novel register blocking method to accelerate GEMV on GPU further. The proposed optimization methods for GEMV are comprehensively evaluated on the matrices with different sizes. Experiment results show that the new methods can achieve over 10x speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or MAGMA, and the new register blocking method can also perform better than CUBLAS 4.0 or MAGMA for large square matrices. We also propose a performance-tuning framework on how to choose an optimal algorithm of GEMV for an arbitrary input matrix on GPU.

computer and information technology | 2009

Design of New Hash Mapping Functions

Fenglong Song; Zhiyong Liu; Dongrui Fan; Junchao Zhang; Lei Yu; Nan Yuan; Wei Lin

Molecular dynamics (MD) simulation has broad applications, and an increasing amount of computing power is needed to satisfy the large scale of the real world simulation. The advent of the many-core paradigm brings unprecedented computing power, but it remains a great challenge to harvest the computing power due to MDs irregular memory-access pattern. To address this challenge, this paper presents a joint application/architecture study to enhance the scalability of MD on Godson-T-like many-core architecture. First, a preprocessing approach leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then three incremental optimization strategies-a novel data-layout to improve data locality, an on-chip locality-aware parallel algorithm to enhance data reuse, and a pipelining algorithm to hide latency to shared memory-are proposed to enhance on-chip parallelism for Godson-T many-core processor. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency of 0.99 on 64 cores, which is confirmed by a field-programmable gate array emulator. Also the performance per watt of MD on Godson-T is much higher than MD on a 16-cores Intel core i7 symmetric multiprocessor (SMP) and 26 times higher than MD on an 8-core 64-thread Sun T2 processor. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a hierarchical parallelization scheme is designed to map the MD algorithm to Godson-T many-core cluster and a simple performance model is derived, which suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.

Journal of Visual Communication and Image Representation | 2014

Fast and scalable lock methods for video coding on many-core architecture

Weizhi Xu; Hui Yu; Dianjie Lu; Fenglong Song; Da Wang; Xiaochun Ye; Songwei Pei; Dongrui Fan; Hongtao Xie

The efficient support of cache coherence is extremely important to design and implement many-core processors. In this paper, we propose a synchronization-based coherence (SBC) protocol to efficiently support cache coherence for shared memory many-core architectures. The unique feature of our scheme is that it doesn’t use directory at all. Inspired by scope consistency memory model, our protocol maintains coherence at synchronization point. Within critical section, processor cores record write-sets (which lines have been written in critical section) with bloom-filter function. When the core releases the lock, the write-set is transferred to a synchronization manager. When another core acquires the same lock, it gets the write-set from the synchronization manager and invalidates stale data in its local cache. Experimental results show that the SBC outperforms by averages of 5% in execution time across a suite of scientific applications. At the mean time, the SBC is more cost-effective comparing to directory-based protocol that requires large amount of hardware resource and huge design verification effort.

Explore More