Guoping Long | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guoping Long is active.

Explore More

Publication

Featured researches published by Guoping Long.

acm sigplan symposium on principles and practice of parallel programming | 2013

StreamScan: fast scan algorithms for GPUs without global barrier synchronization

Shengen Yan; Guoping Long; Yunquan Zhang

Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.

Journal of Computer Science and Technology | 2009

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Dongrui Fan; Nan Yuan; Junchao Zhang; Yongbin Zhou; Wei Lin; Fenglong Song; Xiaochun Ye; He Huang; Lei Yu; Guoping Long; Hao Zhang; Lei Liu

Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

Journal of Computer Science and Technology | 2013

MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs

Yan Li; Yunquan Zhang; Yi-Qun Liu; Guoping Long; Haipeng Jia

Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.

international conference on parallel processing | 2012

GPURoofline: a model for guiding performance optimizations on GPUs

Haipeng Jia; Yunquan Zhang; Guoping Long; Jianliang Xu; Shengen Yan; Yan Li

Performance optimization on GPUs requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem. This paper presents GPURoofline, an empirical model for guiding optimizations on GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels. The model addresses this problem by exploring potential performance bottlenecks and evaluating whether specific optimization techniques bring any performance improvement. To demonstrate the usage of the model, we optimize four representative kernels with different computation densities, namely matrix transpose, Laplace transform, integral and face-dection, on both NVIDIA and AMD GPUs. Experimental results show that under the guidance of GPURoofline, performance of those kernels achieves 3.74˜14.8 times speedup compared to their naive implementations on both NVIDIA and AMD GPU platforms.

acm sigplan symposium on principles and practice of parallel programming | 2009

Architectural support for cilk computations on many-core architectures

Guoping Long; Dongrui Fan; Junchao Zhang

Future generations of high performance processors have the potential to integrate tens to hundreds of processing cores in a single chip. In recent years, many-core architectures [1] have been proposed as a promising platform to exploit massive parallelism. Although previous works have demonstrated encouraging performance potential, there is still limited consensus on how to program many-core architectures. In this work, we propose architectural support for Cilk computations [3, 4] on Godson-T V3 architecture [2]. Our design has two hardware components. One component includes necessary cache control mechanisms to support ScC based multi-threaded programming. We choose ScC model in order to design more scalable cache consistency protocols. In addition, we propose coherence vector, a hardware structure and related programming interfaces to further improve the programmability of ScC. With this architectural support, we port the sequential consistency based multi-threaded Cilk runtime system (Cilk-5.4.6) smoothly onto our many core architecture model, which implements ScC. The other component is the architectural support for DAG consistency. This is important to relieve application programmers from worrying about cache consistency issues themselves. We make two contributions: (1) we propose a set of architecture mechanisms to support Cilk programming model. We show that it is possible to achieve a balance between two conflicting goals: programmability and scalable cache consistency protocols, which is important for MCA. (2) Experimental results reveal two fundamental reasons which limit the performance scalability of MCA: the unbalanced on chip network bandwidth usage and limited memory bandwidth. 2. Architectural Support for Programmability

european conference on parallel processing | 2008

A Performance Model of Dense Matrix Operations on Many-Core Architectures

Guoping Long; Dongrui Fan; Junchao Zhang; Fenglong Song; Nan Yuan; Wei Lin

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number P max . We show that

international conference on parallel processing | 2011

CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices

Xiangzheng Sun; Yunquan Zhang; Ting Wang; Guoping Long; Xianyi Zhang; Yan Li

P_{max}=\Theta(B\ast \sqrt{C})

european conference on parallel processing | 2009

Characterizing and Understanding the Bandwidth Behavior of Workloads on Multi-core Processors

Guoping Long; Dongrui Fan; Junchao Zhang

. P max indicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores P max . The model is validated by a comparison between the theoretical performance and experimental data of previous works.

international conference on multimedia retrieval | 2016

GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring

Wenjing Ma; Liangliang Cao; Lei Yu; Guoping Long; Yucheng Li

Sparse Matrix-Vector multiplication (SpMV) is an important computational kernel in scientific applications. Its performance highly depends on the nonzero distribution of sparse matrices. In this paper, we propose a new storage format for diagonal sparse matrices, defined as Compressed Row Segment with Diagonal-pattern (CRSD). We design diagonal patterns to represent the diagonal distribution. As the diagonal distributions are similar within matrices from one application, some diagonal patterns remain unchanged. First, we sample one matrix to obtain the unchanged diagonal patterns. Next, the optimal SpMV codelets are generated automatically for those diagonal patterns. Finally, we combine the generated codelets as the optimal SpMV implementation. In addition, the information collected during auto-tuning process is also utilized for parallel implementation to achieve load-balance. Experimental results demonstrate that the speedup reaches up to 2.37 (1.70 on average) in comparison with DIA and 4.60 (2.10 on average) in comparison with CSR under the same number of threads on two mainstream multi-core platforms.

web information systems engineering | 2016

Bridging Semantic Gap Between App Names: Collective Matrix Factorization for Similar Mobile App Recommendation

Ning Bu; Shuzi Niu; Lei Yu; Wenjing Ma; Guoping Long

An important issue of current multi-core processors is the off-chip bandwidth sharing. Sharing is helpful to improve resource utilization and but more importantly and it may cause performance degradation due to contention. However and there is not enough research work on characterizing the workloads from bandwidth perspective. Moreover and the understanding of the impact of the bandwidth constraint on performance is still limited. In this paper and we propose the phase execution model and and evaluate the arithmetic to memory ratio (AMR) of each phase to characterize the bandwidth requirements of arbitrary programs. We apply the model to a set of SPEC benchmark programs and obtain two results. First and we propose a new taxonomy of workloads based on their bandwidth requirements. Second and we find that prefetching techniques are useful to improve system throughput of multi-core processors only when there is enough spare memory bandwidth.

Explore More