Is this you? Create Your Porfile

Jianbin Fang

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jianbin Fang is active.

Explore More

Publication

Featured researches published by Jianbin Fang.

Computing | 2017

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance

Cheng Chen; Jianbin Fang; Tao Tang; Canqun Yang

Dense lower–upper (LU) factorization (hereafter referred to as LU) is a critical kernel that is widely used to solve dense linear algebra problems. Hybrid LU algorithms have been well designed to exploit the full capacity of heterogeneous systems. However, existing heterogeneous implementations are typically CPU-centric, which rely highly on CPU cores and suffer from a large amount of data transfers via the PCIe bus, and thus reduce the overall energy efficiency of the entire computer system. In this paper, we provide a coprocessor-resident implementation of LU for a heterogeneous platform to improve energy efficiency by relieving the CPUs from performing heavy load computations and avoiding excessive data transfers via PCIe. To maintain the performance, we conduct optimizations to pipeline the CPU computation, coprocessor computation, MPI communication, and PCIe transfer between the CPUs and coprocessors. The experiments on the Tianhe-2 supercomputer show that our LU implementation can compete with the highly optimized Intel MKL implementation in performance and overcome the limitations of energy efficiency.

parallel computing | 2018

Benchmarking the GPU memory at the warp level

Minquan Fang; Jianbin Fang; Weimin Zhang; Haifang Zhou; Jianxing Liao; Yuangang Wang

Abstract Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level. In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective. Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system.

international parallel and distributed processing symposium | 2017

Efficient and Portable ALS Matrix Factorization for Recommender Systems

Jing Chen; Jianbin Fang; Weifeng Liu; Tao Tang; Xuhao Chen; Canqun Yang

Alternating least squares (ALS) has been proved to be an effective solver of matrix factorization for recommender systems. To speedup factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-core CPUs and many-core GPUs/MICs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an efficient and portable ALS solver for recommender systems. On the one hand, we diagnose the baseline implementation and observe that it lacks the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs, and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently mapping it to the underlying hardware. The experimental results show that our implementation performs 5.5 faster on a 16-core CPU and 21.2 faster on K20c than the baseline implementation. Our implementation also outperforms cuMF on various datasets.

Concurrency and Computation: Practice and Experience | 2017

Efficient and High-quality Sparse Graph Coloring on the GPU.

Xuhao Chen; Pingfan Li; Jianbin Fang; Tao Tang; Zhiying Wang; Canqun Yang

Graph coloring has been broadly used to discover concurrency in parallel computing. To speed up graph coloring for large‐scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work‐efficient parallel graph coloring implementation on GPUs with good coloring quality. Our approach uses the speculative greedy scheme, which inherently yields better quality than the method of finding maximal independent set. To achieve high performance on GPUs, we refine the algorithm to leverage efficient operators and alleviate conflicts. We also incorporate common optimization techniques to further improve performance. Our method is evaluated with both synthetic and real‐world sparse graphs on the NVIDIA GPU. Experimental results show that our proposed implementation achieves averaged 4.1 × (up to 8.9 × ) speedup over the serial implementation. It also outperforms the existing GPU implementation from the NVIDIA CUSPARSE library (2.2 × average speedup), while yielding much better coloring quality than CUSPARSE.

international parallel and distributed processing symposium | 2016

High Performance Parallel Graph Coloring on GPGPUs

Pingfan Li; Xuhao Chen; Zhe Quan; Jianbin Fang; Huayou Su; Tao Tang; Canqun Yang

Graph coloring has been broadly used to discover concurrency in parallel computing, where vertices with the same color represent subtasks that can be processed simultaneously. To speedup graph coloring for large scale datasets, parallel algorithms have been proposed to leverage the massive hardware resources on modern multicore CPUs or GPGPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a high performance parallel graph coloring implementation on GPGPUs with good coloring quality. Our approach employs the speculative greedy algorithm which usually yields better quality than the method based on maximal independent set. In order to achieve high performance on GPGPUs, we adapt the algorithm to improve work efficiency and reduce overhead, and incorporate several optimization techniques which reduce memory access latency and atomic operation overhead. Our method is evaluated with both synthetic and real-world graphs on the NVIDIA GPU. Experimental results show that our proposed implementations outperform the sequential implementation (3.0× speedup) and the existing GPU implementation from the NVIDIA CUSPARSE library (1.5× speedup), while yielding good coloring quality close to the sequential implementation.

programming models and applications for multicores and manycores | 2017

High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs

Pingfan Li; Xuhao Chen; Jie Shen; Jianbin Fang; Tao Tang; Canqun Yang

Detecting strongly connected components (SCC) has been broadly used in many real-world applications. To speedup SCC detection for large-scale graphs, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations are able to get speedup on synthetic graph instances, but show limited performance when applied to large-scale real-world datasets. In this paper, we present a parallel SCC detection implementation on GPUs that achieves high performance on both synthetic and real-world graphs. We use a hybrid method that divides the algorithm into two phases. Our method is able to dynamically change parallelism strategies to maximize performance for each algorithm phase. We then orchestrates the graph traversal kernel with customized strategy for each phase, and employ algorithm extensions to handle the serialization problem caused by irregular graph properties. Our design is carefully implemented to take advantage of the GPU hardware. Evaluation with diverse graphs on the NVIDIA K20c GPU shows that our proposed implementation achieves an average speedup of 5.0x over the serial Tarjans algorithm. It also outperforms the existing OpenMP implementation with a speedup of 1.4x.

computing frontiers | 2017

High Performance Coordinate Descent Matrix Factorization for Recommender Systems

Xi Yang; Jianbin Fang; Jing Chen; Chengkun Wu; Tao Tang; Kai Lu

Coordinate descent (CD) has been proved to be an effective technique for matrix factorization (MF) in recommender systems. To speed up factorizing performance, various methods of implementing parallel CDMF have been proposed to leverage modern multi-core CPUs and many-core GPUs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an efficient and portable CDMF solver for recommender systems. On the one hand, we diagnose the baseline implementation and observe that it lacks the awareness of the hierarchical thread organization on modern hardware and the data variance of the rating matrix. Thus, we apply the thread batching technique and the load balancing technique to achieve high performance. On the other hand, we implement the CDMF solver in OpenCL so that it can run on various platforms. Based on the architectural specifics, we customize code variants to efficiently map them to the underlying hardware. The experimental results show that our implementation performs 2x faster on dual-socket Intel Xeon CPUs and 22x faster on an NVIDIA K20c GPU than the baseline implementations. When taking the CDMF solver as a benchmark, we observe that it runs 2.4x faster on the GPU than on the CPUs, whereas it achieves competitive performance on Intel MIC against the CPUs.

The Journal of Supercomputing | 2017

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

Cheng Chen; Yunfei Du; Ke Zuo; Jianbin Fang; Canqun Yang

Massively heterogeneous architectures are widely adopted for the design of modern peta-scale and future exa-scale systems. In such heterogeneous clusters, due to the increasing number of involved components, it is essential to enable fault tolerance to improve the reliability of the whole system. However, existing programming models for heterogeneous clusters (e.g., MPI

international parallel and distributed processing symposium | 2016