Gu Liu
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gu Liu.
frontier of computer science and technology | 2009
Gu Liu; Hong An; Wenting Han; Guang Xu; Ping Yao; Mu Xu; Xiurui Hao; Yaobin Wang
Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using all kinds of optimization methods algorithmically, but the intrinsic reason which lies in GPU architectural features for this performance improvement. Thus we present a study of several block encryption algorithms(AES, TRI-DES, RC5, TWOFISH and the chained block cipher formed by their combinations) processing on GPU using CUDA. We introduce our CUDA implementations, and investigate the program behavioral characteristics and their impacts on the performance in four aspects. We find that the number of threads used by a CUDA program can affect the overall performance fundamentally. Many block encryption algorithms can benefit from the shared memory if the capacity is large enough to hold the lookup tables. The data stored in device memory should be organized purposely to avoid performance degradation. Besides, the communication between host and device may turn out to be the bottleneck of a program. Through these analyses we hope to find out an effective way to optimize a CUDA program, as well as to reveal some desirable architectural features to support block encryption applications better.
international conference on high performance computing and simulation | 2010
Ping Yao; Hong An; Mu Xu; Gu Liu; Xiaoqiang Li; Yaobin Wang; Wenting Han
GPUs have recently been used to accelerate data-parallel applications for they provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power. Most of those applications use CPU as a controller who decides when GPUs run the computing-intensive tasks. This CPU-control-GPU-compute pattern wastes much of CPUs computational power. In this paper, we present a new CPU-GPU cooperative pattern for bioinformatics applications which can use both of CPU and GPU to compute. This pattern includes two parts: 1) the load-balanced data structure which manages data to keep the computational efficiency of GPU high enough when the length distribution of sequences in a sequence database is very uneven; 2) multi-threaded code partition which schedules computing on CPU and GPU in a cooperative way. Using this pattern, we develop CuHMMer based on HMMER which is one of the most important algorithms in bioinformatics. The experimental result demonstrates that CuHMMer get 13x to 45x speed up over available CPU implementations and could also outperform the traditional CUDA implementations which use CPU-control-GPU-compute pattern.
international parallel and distributed processing symposium | 2012
Xiaoqiang Li; Wenting Han; Gu Liu; Hong An; Mu Xu; Wei Zhou; Qi Li
Due to the exponentially growing bioinformatics databases and rapidly popular of GPU for general purpose computing, it is promising to employ GPU techniques to accelerate the sequence search process. Hmmsearch from HMMER bioinformatics software package is a wildly used software tool for sensitive profile HMM (Hidden Markov Model) searches of biological sequence databases. In this paper, we implement a speculative hmmsearch implementation on NVIDIA Fermi GPU and apply various optimizations to it. We test the enhancements in our GPU implementation in order to demonstrate the effectiveness of optimization strategies. Result shows that our speculative hmmsearch implementation achieves up to 6.5x speedup over previous fast single-threaded SSE implementation.
acm sigplan symposium on principles and practice of parallel programming | 2012
Gu Liu; Hong An; Wenting Han; Xiaoqiang Li; Tao Sun; Wei Zhou; Xuechao Wei; Xulong Tang
In this paper, we present FlexBFS, a parallelism-aware implementation for breadth-first search on GPU. Our implementation can adjust the computation resources according to the feedback of available parallelism dynamically. We also optimized our program in three ways: (1)a simplified two-level queue management,(2)a combined kernel strategy and (3)a high-degree vertices specialization approach. Our experimental results show that it can achieve 3~20 times speedup against the fastest serial version, and can outperform the TBB based multi-threading CPU version and the previous most effective GPU version on all types of input graphs.
bio-inspired computing: theories and applications | 2010
Qian Zhang; Hong An; Gu Liu; Wenting Han; Ping Yao; Mu Xu; Xiaoqiang Li
Memory optimization is an important strategy to gain high performance for sequence alignment implemented by CUDA on GPGPU. Smith-Waterman (SW) algorithm is the most sensitive algorithm widely used for local sequence alignment but very time consuming. Although several parallel methods have been used in some studies and shown good performances, advantages of GPGPU memory hierarchy are still not fully exploited. This paper presents a new parallel method on GPGPU using on-chip memory more efficiently to optimize parallel Smith-Waterman sequence alignment presented by Gregory M. Striemer. To minimize the cost of data transfers, on-chip shared memory is used to store intermediate results. Constant memory is also used effectively in our implementation of parallel Smith-Waterman algorithm. Using these two kinds of on-chip memory decreases long latency memory access operations, and reduces demand for global memory when aligning longer sequences. The experimental results show 1.66x to 3.16x speedup over Gregorys parallel SW on GPGPU in terms of execution time and 19.70x speedup on average and 22.43x speedup peak performance over serial SW in terms of clock cycles on our computer platform.
intelligent information hiding and multimedia signal processing | 2009
Guang Xu; Hong An; Gu Liu; Ping Yao; Mu Xu; Wenting Han; Xiaoqiang Li; Xiurui Hao
Multimedia and some scientific applications have achieved good performance on the stream processor architecture by employing the stream programming model. In order to find out the way to accelerate the symmetric cryptograph on stream processor, we implement and analyze cryptograph algorithms on different stream processors in this paper. Four cipher algorithms including RC5, AES, TWOFISH and 3DES in ECB model are implemented on three platforms, which are stream processor SPI Storm SP16-G160, NVIDIA GeForce 9800GTX, Intel Core2 dual-core processor E7300. The difference of architecture between two stream processors and the character of programming model are described. When we compare throughput rate of these applications, 9800GTX is shown with 4-30x performance improvement over E7300, SP16 achieves the highest power efficiency and obtains 15-20x increase over E7300 in Gops/Watt.
Archive | 2011
Wei Zhou; Hong An; Hongping Yang; Gu Liu; Mu Xu; Xiaoqiang Li
In short-term weather analysis, we use clustering algorithm as a fundamental tool to analyze and display the radar reflectivity data. Different from ordinary parallel k-means clustering algorithms using compute unified device architecture, in our clustering of radar reflectivity data, we face the dataset of large scale and the high dimension of texture feature vector we used as clustering space. Therefore, the memory access latency becomes a new bottleneck in our application of short-term weather analysis which requests real time. We propose a novel parallel k-means method on graphics processing units which utilizes on-chip registers and shared memory to cut the dependency of off-chip global memory. The experimental results show that our optimization approach can achieve 40× performance improvement compared to the serial code. It sharply reduces the algorithm’s running time and makes it satisfy the request of real time in applications of short-term weather analysis.
parallel and distributed computing: applications and technologies | 2009
Mu Xu; Hong An; Gu Liu; Yaobin Wang; Guang Xu; Ping Yao; Xiurui Hao; Wenting Han
The Cell Broadband Engine is a typical heterogeneous chip multiprocessor which provides potential high performance for computing-intensive applications. Our researches focus on how to use Cell to speed up block cryptography applications. In this paper, we propose a mapping framework for block cryptography working in ECB mode and corresponding optimizing strategy. We take four algorithms(RC5, 3DES, AES, and Twofish) as benchmark and implement these four algorithms using Cell programming language. In order to enhance the performance, we present an optimizing strategy and evaluate the effects of the optimizing methods including compiler optimization, dual buffering, vectorization, and loop unrolling. The experiments indicate that all these four algorithms can obtain 5-20 times speedup compared with traditional processors, which shows that our mapping framework and optimizing strategy are effective for the block cryptography algorithms.
frontier of computer science and technology | 2009
Yaobin Wang; Hong An; Jie Yan; Qi Li; Wenting Han; Li Wang; Gu Liu
Archive | 2010
Hong An; Ping Yao; Gu Liu; Guang Xu; Mu Xu; Xiaoqiang Li; Wenting Han; Qian Zhang; Hengyang Xu