Shuai Jiao
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shuai Jiao.
software engineering, artificial intelligence, networking and parallel/distributed computing | 2012
Weizhi Xu; Hao Zhang; Shuai Jiao; Da Wang; Fenglong Song; Zhiyong Liu
It is an important task to tune performance for sparse matrix vector multiplication (SpMV), but it is also a difficult task because of its irregularity. In this paper, we propose a cache blocking method to improve the performance of SpMV on the emerging GPU architecture. The sparse matrix is partitioned into many sub-blocks, which are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time spent on accessing the global memory for vector x is reduced heavily. Experimental results on GeForce GTX 480 show that SpMV kernel with the cache blocking method is 5x faster than the unblocked CSR kernel in the best case.
computer and information technology | 2012
Haiquan Xiong; Zhiyong Liu; Weizhi Xu; Shuai Jiao
Semantic gap is one of the most important problems in the virtualized computer systems. Solving this problem not only helps to develop security and virtual machine monitoring applications, but also benefits for VMM resource management and VMM-based service implementation. In this paper, we first review the general architecture of virtual computer systems, especially the interaction principles between Guest OS and VMM so as to better understand the causes of semantic gap. Then we consider how to build a library that can integrate the commonalities existing in different VMMs and Guest OSes. It should be general enough to facilitate the above mentioned applications development. In order to more clearly illustrate this issue, we select Libvmi as the case study, elaborating its design philosophy and implementation. Finally, we describe how to use Libvmi with two application examples and verify its correctness and effectiveness.
international conference on parallel and distributed systems | 2012
Weizhi Xu; Zhiyong Liu; Jun Wu; Xiaochun Ye; Shuai Jiao; Da Wang; Fenglong Song; Dongrui Fan
GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lead to very different performances. The matrix-vector multiplication routine for general dense matrices (GEMV) is a building block for many scientific and engineering computations. We find that the implementations of GEMV in CUBLAS 4.0 or MAGMA are not efficient, especially for small matrix or fat matrix (a matrix with small number of rows and large number of columns). In this paper, we propose two new algorithms to optimize GEMV on Fermi GPU. Instead of using only one thread, we use a warp to compute an element of vector y. We also propose a novel register blocking method to accelerate GEMV on GPU further. The proposed optimization methods for GEMV are comprehensively evaluated on the matrices with different sizes. Experiment results show that the new methods can achieve over 10x speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or MAGMA, and the new register blocking method can also perform better than CUBLAS 4.0 or MAGMA for large square matrices. We also propose a performance-tuning framework on how to choose an optimal algorithm of GEMV for an arbitrary input matrix on GPU.
international conference on parallel processing | 2012
Shuai Jiao; Paolo Ienne; Xiaochun Ye; Da Wang; Dongrui Fan; Ninghui Sun
This paper addresses the workload partition strategies in the simulation of manycore architectures. The key observation behind this paper is that, compared to traditional multicores, manycores feature more non-uniform memory access and unpredictable network traffic; these features degrades simulation speed and accuracy of Parallel Discrete Event Simulators (PDES) when one uses static workload partition schemes. Based on the observation, we propose an adaptive workload partition method: Core/Router-Adaptive Workload Partition (CRAW/P). The method delivers more speedup and accuracy than static partition schemes by partitioning the simulation of on-chip-network independently from that of the cores and by synchronizing them differently. Using a PDES simulator, we evaluate the performance of CRAW/P in simulating a 256-core general purpose many-core processor. Running SPLASH2 benchmark applications, the experimental results demonstrate it can deliver speed improvement by 28%˜67% over static partition scheme and reduces timing errors to <10% in very relaxed simulation (quantum size as 64).
Archive | 2010
Dongrui Fan; Shuai Jiao; Zhengmeng Lei; Weidong Xu; Hao Zhang
ieee international conference on high performance computing data and analytics | 2012
Shuai Jiao; Da Wang; Xiaochun Ye; Weizhi Xu; Hao Zhang; Ninghui Sun
World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering | 2012
Weizhi Xu; Zhiyong Liu; Dongrui Fan; Shuai Jiao; Xiaochun Ye; Fenglong Song; Chenggang Yan
Archive | 2012
Weizhi Xu; Shuai Jiao; Hao Zhang; Zhiyong Liu; Dongrui Fan; Zhengmeng Lei; Fenglong Song; Da Wang
Archive | 2012
Shuai Zhang; 张帅; Shuai Jiao; 焦帅; Hao Zhang; 张浩; Dongrui Fan; 范东睿; Haizhong Li; 李海忠
Chinese Journal of Computers | 2011
Shuai Jiao; Wei-Zhi Xu; Shibin Tang; Dongrui Fan; Ninghui Sun