Weizhi Xu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Weizhi Xu is active.

Explore More

Publication

Featured researches published by Weizhi Xu.

software engineering, artificial intelligence, networking and parallel/distributed computing | 2012

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

Weizhi Xu; Hao Zhang; Shuai Jiao; Da Wang; Fenglong Song; Zhiyong Liu

It is an important task to tune performance for sparse matrix vector multiplication (SpMV), but it is also a difficult task because of its irregularity. In this paper, we propose a cache blocking method to improve the performance of SpMV on the emerging GPU architecture. The sparse matrix is partitioned into many sub-blocks, which are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time spent on accessing the global memory for vector x is reduced heavily. Experimental results on GeForce GTX 480 show that SpMV kernel with the cache blocking method is 5x faster than the unblocked CSR kernel in the best case.

computer and information technology | 2012

Libvmi: A Library for Bridging the Semantic Gap between Guest OS and VMM

Haiquan Xiong; Zhiyong Liu; Weizhi Xu; Shuai Jiao

Semantic gap is one of the most important problems in the virtualized computer systems. Solving this problem not only helps to develop security and virtual machine monitoring applications, but also benefits for VMM resource management and VMM-based service implementation. In this paper, we first review the general architecture of virtual computer systems, especially the interaction principles between Guest OS and VMM so as to better understand the causes of semantic gap. Then we consider how to build a library that can integrate the commonalities existing in different VMMs and Guest OSes. It should be general enough to facilitate the above mentioned applications development. In order to more clearly illustrate this issue, we select Libvmi as the case study, elaborating its design philosophy and implementation. Finally, we describe how to use Libvmi with two application examples and verify its correctness and effectiveness.

international conference on parallel and distributed systems | 2012

Auto-Tuning GEMV on Many-Core GPU

Weizhi Xu; Zhiyong Liu; Jun Wu; Xiaochun Ye; Shuai Jiao; Da Wang; Fenglong Song; Dongrui Fan

GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lead to very different performances. The matrix-vector multiplication routine for general dense matrices (GEMV) is a building block for many scientific and engineering computations. We find that the implementations of GEMV in CUBLAS 4.0 or MAGMA are not efficient, especially for small matrix or fat matrix (a matrix with small number of rows and large number of columns). In this paper, we propose two new algorithms to optimize GEMV on Fermi GPU. Instead of using only one thread, we use a warp to compute an element of vector y. We also propose a novel register blocking method to accelerate GEMV on GPU further. The proposed optimization methods for GEMV are comprehensively evaluated on the matrices with different sizes. Experiment results show that the new methods can achieve over 10x speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or MAGMA, and the new register blocking method can also perform better than CUBLAS 4.0 or MAGMA for large square matrices. We also propose a performance-tuning framework on how to choose an optimal algorithm of GEMV for an arbitrary input matrix on GPU.

Journal of Visual Communication and Image Representation | 2014

Fast and scalable lock methods for video coding on many-core architecture

Weizhi Xu; Hui Yu; Dianjie Lu; Fenglong Song; Da Wang; Xiaochun Ye; Songwei Pei; Dongrui Fan; Hongtao Xie

We propose a centralized hardware lock method for many-core architecture.We propose a distributed hardware lock method for many-core architecture.We study and compare the performance of the two proposed lock methods and software lock. Many-core processors are good candidates for speeding up video coding because the parallelism of these applications can be exploited more efficiently by the many-core architecture. Lock methods are important for many-core architecture to ensure correct execution of the program and communication between threads on chip. The efficiency of lock method is critical to overall performance of chipped many-core processor. In this paper, we propose two types of hardware locks for on-chip many-core architecture, a centralized lock and a distributed lock. First, we design the architectures of centralized lock and distributed lock to implement the two hardware lock methods. Then, we evaluate the performance of the two hardware locks and a software lock by quantitative evaluation micro-benchmarks on a many-core processor simulator Godson-T. The experimental results show that the locks with dedicated hardware support have higher performance than the software lock, and the distributed hardware lock is more scalable than the centralized hardware lock.

european conference on parallel processing | 2010

Efficient address mapping of shared cache for on-chip many-core architecture

Fenglong Song; Dongrui Fan; Zhiyong Liu; Junchao Zhang; Lei Yu; Weizhi Xu

Performance of the on-chip cache is critical for processor. The multithread program model usually employed by on-chip many-core architectures may have effects on cache access patterns and eventually on cache conflict miss behaviors. However, the behavior of cache is still unclear, and little has been known of the effectiveness of XOR mapping scheme for many-core systems. In this paper we focus on these problems. We propose an XOR-based address mapping scheme for on-chip many core architecture to increase performance of cache system. Then we evaluate the proposed scheme for various applications, including an application for bioinformatics, matrix multiplication, LU decomposition, FFT from Splash2 benchmarks. Experimental results show that with the proposed scheme, it makes conflict misses of shared cache reduced by about 53% on average, and makes overall performance improved by about 6%. Experimental results also show that the XOR scheme is more cost effectively than victim cache scheme.

ieee international conference on high performance computing data and analytics | 2012

PartitionSim: A Parallel Simulator for Many-cores

Shuai Jiao; Da Wang; Xiaochun Ye; Weizhi Xu; Hao Zhang; Ninghui Sun

World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering | 2012