Shuai Jiao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuai Jiao is active.

Explore More

Publication

Featured researches published by Shuai Jiao.

software engineering, artificial intelligence, networking and parallel/distributed computing | 2012

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

Weizhi Xu; Hao Zhang; Shuai Jiao; Da Wang; Fenglong Song; Zhiyong Liu

It is an important task to tune performance for sparse matrix vector multiplication (SpMV), but it is also a difficult task because of its irregularity. In this paper, we propose a cache blocking method to improve the performance of SpMV on the emerging GPU architecture. The sparse matrix is partitioned into many sub-blocks, which are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time spent on accessing the global memory for vector x is reduced heavily. Experimental results on GeForce GTX 480 show that SpMV kernel with the cache blocking method is 5x faster than the unblocked CSR kernel in the best case.

computer and information technology | 2012

Libvmi: A Library for Bridging the Semantic Gap between Guest OS and VMM

Haiquan Xiong; Zhiyong Liu; Weizhi Xu; Shuai Jiao

Semantic gap is one of the most important problems in the virtualized computer systems. Solving this problem not only helps to develop security and virtual machine monitoring applications, but also benefits for VMM resource management and VMM-based service implementation. In this paper, we first review the general architecture of virtual computer systems, especially the interaction principles between Guest OS and VMM so as to better understand the causes of semantic gap. Then we consider how to build a library that can integrate the commonalities existing in different VMMs and Guest OSes. It should be general enough to facilitate the above mentioned applications development. In order to more clearly illustrate this issue, we select Libvmi as the case study, elaborating its design philosophy and implementation. Finally, we describe how to use Libvmi with two application examples and verify its correctness and effectiveness.

international conference on parallel and distributed systems | 2012

Auto-Tuning GEMV on Many-Core GPU

Weizhi Xu; Zhiyong Liu; Jun Wu; Xiaochun Ye; Shuai Jiao; Da Wang; Fenglong Song; Dongrui Fan

GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lead to very different performances. The matrix-vector multiplication routine for general dense matrices (GEMV) is a building block for many scientific and engineering computations. We find that the implementations of GEMV in CUBLAS 4.0 or MAGMA are not efficient, especially for small matrix or fat matrix (a matrix with small number of rows and large number of columns). In this paper, we propose two new algorithms to optimize GEMV on Fermi GPU. Instead of using only one thread, we use a warp to compute an element of vector y. We also propose a novel register blocking method to accelerate GEMV on GPU further. The proposed optimization methods for GEMV are comprehensively evaluated on the matrices with different sizes. Experiment results show that the new methods can achieve over 10x speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or MAGMA, and the new register blocking method can also perform better than CUBLAS 4.0 or MAGMA for large square matrices. We also propose a performance-tuning framework on how to choose an optimal algorithm of GEMV for an arbitrary input matrix on GPU.

international conference on parallel processing | 2012

CRAW/P: a workload partition method for the efficient parallel simulation of manycores

Shuai Jiao; Paolo Ienne; Xiaochun Ye; Da Wang; Dongrui Fan; Ninghui Sun

This paper addresses the workload partition strategies in the simulation of manycore architectures. The key observation behind this paper is that, compared to traditional multicores, manycores feature more non-uniform memory access and unpredictable network traffic; these features degrades simulation speed and accuracy of Parallel Discrete Event Simulators (PDES) when one uses static workload partition schemes. Based on the observation, we propose an adaptive workload partition method: Core/Router-Adaptive Workload Partition (CRAW/P). The method delivers more speedup and accuracy than static partition schemes by partitioning the simulation of on-chip-network independently from that of the cores and by synchronizing them differently. Using a PDES simulator, we evaluate the performance of CRAW/P in simulating a 256-core general purpose many-core processor. Running SPLASH2 benchmark applications, the experimental results demonstrate it can deliver speed improvement by 28%˜67% over static partition scheme and reduces timing errors to <10% in very relaxed simulation (quantum size as 64).

Archive | 2010

JTAG (Joint Test Action Group) real-time on-chip debug method and system of multicore processor

Dongrui Fan; Shuai Jiao; Zhengmeng Lei; Weidong Xu; Hao Zhang

ieee international conference on high performance computing data and analytics | 2012

PartitionSim: A Parallel Simulator for Many-cores

Shuai Jiao; Da Wang; Xiaochun Ye; Weizhi Xu; Hao Zhang; Ninghui Sun

World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering | 2012