Is this you? Create Your Porfile

Xiaowen Feng

Huazhong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaowen Feng is active.

Explore More

Publication

Featured researches published by Xiaowen Feng.

international conference on parallel and distributed systems | 2011

Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs

Xiaowen Feng; Hai Jin; Ran Zheng; Kan Hu; Jingxiang Zeng; Zhiyuan Shao

Sparse Matrix-Vector multiplication (SpMV) is one of the most significant yet challenging issues in computational science area. It is a memory-bound application whose performance mostly depends on the input matrix and the underlying architecture. Many researchers have paid more attentions on exploring a variety of optimization techniques to SpMV. One of the most promising respects is how to adapt the storage format to satisfy the underlying architecture. Alterative storage formats can largely lessen memory pressure, however, the computational resources are often underutilized. Therefore, a new storage format, which is called Compressed Sparse Row with Segmented Interleave Combination (SIC), is proposed. Stemming from Compressed Sparse Row format (CSR), SIC format employs an interleave combination pattern that combines certain amount of CSR rows to form a new SIC row. In order to further improve performance, segmented processing is also brought in. According to the empirical data, we also develop an automatic SIC-based SpMV suitable for all the matrices. Experimental results show that our approach outperforms the NVIDIA CSR vector kernel, achieving up to 12.6 × speedup. It also demonstrates a comparable performance with the Hybrid format, even with the highest 2.89 × speedup.

Iet Image Processing | 2014

Weighting scheme for image retrieval based on bag-of-visual-words

Lei Zhu; Hai Jin; Ran Zheng; Xiaowen Feng

Inspired by the success of bag-of-words in text retrieval, bag-of-visual-words and its variants are widely used in content-based image retrieval to describe visual content. Various weighting schemes have also been proposed to integrate different yet complementary visual-words. However, most of these weighting schemes tend to use fixed weight for every visual-word extracted from the query image, which may lose the discriminative information. This study presents a novel combining method which captures query-specific weights for visual-words in query image. The method mainly contains two stages. Firstly, in offline weight learning, the authors introduce a linear classifier to build a query-category mapping table, and max-margin learning to build category-weight mapping table. Query-category mapping table is used to map the query image to the most likely image class, and category-weight mapping table is used to map image class to the weights of visual-words. Secondly, in online weight mapping, the weights of visual-words are determined efficiently by looking into the pre-learned mapping tables. Experimental results on WANG database and Caltech 101 demonstrate that the proposed weighting scheme can effectively weight visual-words of query image according to their discriminative information. In addition, comparative experiments demonstrate the proposed weighting scheme can obtain higher retrieval performance than other weighting schemes.

Concurrency and Computation: Practice and Experience | 2014

A segment‐based sparse matrix–vector multiplication on CUDA

Xiaowen Feng; Hai Jin; Ran Zheng; Zhiyuan Shao; Lei Zhu

The challenge for Sparse Matrix–Vector multiplication (SpMV) performance is memory bandwidth, which mostly depends on input matrices and underlying computing platforms. To solve this challenge, many researchers have explored a variety of optimization techniques. One of the most promising aspects focuses on designing storage formats to represent sparse matrices. However, lots of prior storage formats cannot fully take advantage of the underlying computing platforms, resulting in unsatisfactory performance and large memory footprint. Therefore, a novel storage format, called Segmented Hybrid ELL + Compressed Sparse Row (CSR) (SHEC for short), is proposed to further improve the throughput and lessen memory footprint on Graphics Processing Unit (GPU). SHEC format employs an interleaved combination pattern, which combines certain amount of compressed rows to form a new SHEC row. Segmentation is brought in to balance load and reduce memory footprint. According to the empirical data, an automatic SHEC‐based SpMV is developed to fit for all the matrices. Experimental results show that SHEC approach outperforms the best results of NVIDIA SpMV library and exhibits a comparable performance with state‐of‐the‐art storage formats on the standard dataset. Copyright

chinagrid annual conference | 2012

Parallelization Mechanisms of Neighbor-Joining for CUDA Enabled Devices

Ran Zheng; Qiongyao Zhang; Hai Jin; Zhiyuan Shao; Xiaowen Feng

Multiple Sequence Alignment (MSA) is a fundamental process in bioinformatics in which phylogenetic tree reconstruction is an essential operation. Neighbor-Joining algorithm is the best approach to reconstruct phylogenetic tree with its less time and space costs. With the rapid increase of biological sequences, it will take many hours or even days to reconstruct phylogenetic tree because of the complex computing for multiple sequence alignment. In this paper, two mechanisms for parallelizing Neighbor-Joining algorithm are proposed based on CUDA to get higher performance of lower time and space costs. Data dependency is reduced by converting the running mode and dynamic multiple granularity mechanism is used to figure out imbalance guiding tree with lower rate of resources occupation and higher efficiency. The parallelization mechanisms have achieved average speedups of 18.6 for thousands of datasets as well as far genetic relationship datasets compared to the basic method.

2014 International Conference on Smart Computing | 2014

Near-duplicate detection using GPU-based simhash scheme

Xiaowen Feng; Hai Jin; Ran Zheng; Lei Zhu

With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing near-duplicate detection on different architectures. In this paper, a new implementation, using a special hash function namely simhash, is proposed to identify near-duplicate documents on CUDA enabled devices. Two mechanisms are designed to achieve higher performance, including swapping and dynamic allocating. Experimental results show that our parallel implementation outperforms the serial CPU version, achieving up to 18 times.

asia pacific symposium on internetware | 2013

Accelerate MapReduce on GPUs with multi-level reduction

Ran Zheng; Kai Liu; Hai Jin; Qin Zhang; Xiaowen Feng

With Graphics Processing Units (GPUs) becoming more and more popular in general purpose computing, more attentions have been paid on building a framework to provide convenient interfaces for GPU programming. MapReduce can greatly simplify the programming for data-parallel applications in cloud computing environment, and it is also naturally suitable for GPUs. However, there are some problems in recent reduction-based MapReduce implementation on GPUs. Its performance is dramatically degraded when handling massive distinct keys because the massive data cannot be stored in tiny shared memory entirely. A new MapReduce framework on GPUs, called Jupiter, is proposed with continuous reduction structure. Two improvements are supported in Jupiter, a multi-level reduction scheme tailored for GPU memory hierarchy and a frequency-based cache policy on key-value pairs in shared memory. Shared memories are utilized efficiently for various data-parallel applications whether involving little or abundant distinct keys. Experiments show that Jupiter can achieve up to 3x speedup over the original reduction-based GPU MapReduce framework on the applications with lots of distinct keys.

international conference on cloud and green computing | 2012

Implementing Smith-Waterman Algorithm with Two-Dimensional Cache on GPUs

Xiaowen Feng; Hai Jin; Ran Zheng; Zhiyuan Shao; Lei Zhu

Finding regions of similarity between two data streams is a computational intensive and memory consuming problem, which refers to as sequence alignment for biological sequence. Smith-Waterman algorithm is an optimal method to find the local sequence alignment. It requires a large amount of computation and memory, and is also constrained by the memory access speed when accelerated by using Graphics Processing Units (GPUs). A new method to implement Smith-Waterman algorithm with two-dimensional cache is proposed, which aims at accelerating the first stage of Smith-Waterman algorithm and coalesced writing back the corresponding results to GPU global memory. Our proposal is implemented over NVIDIA Geforce GTX295 GPU, and compared with CUDASW++ 2.0. Experimental results show that our approach outperforms CUDASW++ 2.0 in the datasets chosen from NCBI.

international conference on digital information management | 2014

Parallel singular value decomposition on heterogeneous multi-core and multi-GPU platforms

Xiaowen Feng; Hai Jin; Ran Zheng; Lei Zhu

Singular value decomposition (SVD) is one of the most fundamental matrix calculations in numerical linear algebra. Traditional solution is the QR-iteration-based SVD algorithm on CPU, and it is time-consuming. Nowadays, Graphics Processing Units (GPUs) are suited for many general purpose tasks and have emerged as low price and high performance accelerators. In this paper, the parallel-friendly divide-and-conquer approach is employed to accelerate SVD algorithm on the heterogeneous multicore and multi-GPU systems. Two mechanisms are designed to make good use of the computational resource on the heterogeneous system, including two-layer divide-and-conquer and coordination between CPU and GPU. The experimental results show that our algorithm is faster than Intel MKL with four CPU cores, and reaches 45 times speedup with four NVIDIA GTX460 GPUs over LAPACK. Our implementation can also achieve about 1.5 times speedup by doubling the number of GPU devices.

Computers & Electrical Engineering | 2014

A GPU-based parallel method for evolutionary tree construction

Ran Zheng; Qiongyao Zhang; Hai Jin; Zhiyuan Shao; Xiaowen Feng

Evolutionary trees are widely applied in various applications to show the inferred evolutionary relationships among species or entities. Neighbor-Joining is one solution for data-intensive and time-consuming evolutionary tree construction, with polynomial time complexity. However, its performance becomes poorer with the growth of massive datasets. Graphics Processing Units (GPUs) have brought about new opportunities for these time-consuming applications. Based on its high efficiency, a GPU-based parallel Neighbor-Joining method is proposed, and two efficient parallel mechanisms, data segmentation with asynchronous processing and the minimal chain model with bitonic sort, are put forward to speed up the processing. The experimental results show that an average speedup of 25.1 is achieved and even approximately 30 can be obtained with a sequence dataset ranging from 16,000 to 25,000. Moreover, the proposed parallel mechanisms can be effectively exploited in some other high performance applications.

The Journal of Supercomputing | 2014