Is this you? Create Your Porfile

Wenting Han

University of Science and Technology of China

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wenting Han is active.

Explore More

Publication

Featured researches published by Wenting Han.

frontier of computer science and technology | 2009

A Program Behavior Study of Block Cryptography Algorithms on GPGPU

Gu Liu; Hong An; Wenting Han; Guang Xu; Ping Yao; Mu Xu; Xiurui Hao; Yaobin Wang

Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using all kinds of optimization methods algorithmically, but the intrinsic reason which lies in GPU architectural features for this performance improvement. Thus we present a study of several block encryption algorithms(AES, TRI-DES, RC5, TWOFISH and the chained block cipher formed by their combinations) processing on GPU using CUDA. We introduce our CUDA implementations, and investigate the program behavioral characteristics and their impacts on the performance in four aspects. We find that the number of threads used by a CUDA program can affect the overall performance fundamentally. Many block encryption algorithms can benefit from the shared memory if the capacity is large enough to hold the lookup tables. The data stored in device memory should be organized purposely to avoid performance degradation. Besides, the communication between host and device may turn out to be the bottleneck of a program. Through these analyses we hope to find out an effective way to optimize a CUDA program, as well as to reveal some desirable architectural features to support block encryption applications better.

international conference on high performance computing and simulation | 2010

CuHMMer: A load-balanced CPU-GPU cooperative bioinformatics application

Ping Yao; Hong An; Mu Xu; Gu Liu; Xiaoqiang Li; Yaobin Wang; Wenting Han

GPUs have recently been used to accelerate data-parallel applications for they provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power. Most of those applications use CPU as a controller who decides when GPUs run the computing-intensive tasks. This CPU-control-GPU-compute pattern wastes much of CPUs computational power. In this paper, we present a new CPU-GPU cooperative pattern for bioinformatics applications which can use both of CPU and GPU to compute. This pattern includes two parts: 1) the load-balanced data structure which manages data to keep the computational efficiency of GPU high enough when the length distribution of sequences in a sequence database is very uneven; 2) multi-threaded code partition which schedules computing on CPU and GPU in a cooperative way. Using this pattern, we develop CuHMMer based on HMMER which is one of the most important algorithms in bioinformatics. The experimental result demonstrates that CuHMMer get 13x to 45x speed up over available CPU implementations and could also outperform the traditional CUDA implementations which use CPU-control-GPU-compute pattern.

international parallel and distributed processing symposium | 2012

A Speculative HMMER Search Implementation on GPU

Xiaoqiang Li; Wenting Han; Gu Liu; Hong An; Mu Xu; Wei Zhou; Qi Li

Due to the exponentially growing bioinformatics databases and rapidly popular of GPU for general purpose computing, it is promising to employ GPU techniques to accelerate the sequence search process. Hmmsearch from HMMER bioinformatics software package is a wildly used software tool for sensitive profile HMM (Hidden Markov Model) searches of biological sequence databases. In this paper, we implement a speculative hmmsearch implementation on NVIDIA Fermi GPU and apply various optimizations to it. We test the enhancements in our GPU implementation in order to demonstrate the effectiveness of optimization strategies. Result shows that our speculative hmmsearch implementation achieves up to 6.5x speedup over previous fast single-threaded SSE implementation.

acm sigplan symposium on principles and practice of parallel programming | 2012

FlexBFS: a parallelism-aware implementation of breadth-first search on GPU

Gu Liu; Hong An; Wenting Han; Xiaoqiang Li; Tao Sun; Wei Zhou; Xuechao Wei; Xulong Tang

In this paper, we present FlexBFS, a parallelism-aware implementation for breadth-first search on GPU. Our implementation can adjust the computation resources according to the feedback of available parallelism dynamically. We also optimized our program in three ways: (1)a simplified two-level queue management,(2)a combined kernel strategy and (3)a high-degree vertices specialization approach. Our experimental results show that it can achieve 3~20 times speedup against the fastest serial version, and can outperform the TBB based multi-threading CPU version and the previous most effective GPU version on all types of input graphs.

bio-inspired computing: theories and applications | 2010

The optimization of parallel Smith-Waterman sequence alignment using on-chip memory of GPGPU

Qian Zhang; Hong An; Gu Liu; Wenting Han; Ping Yao; Mu Xu; Xiaoqiang Li

Memory optimization is an important strategy to gain high performance for sequence alignment implemented by CUDA on GPGPU. Smith-Waterman (SW) algorithm is the most sensitive algorithm widely used for local sequence alignment but very time consuming. Although several parallel methods have been used in some studies and shown good performances, advantages of GPGPU memory hierarchy are still not fully exploited. This paper presents a new parallel method on GPGPU using on-chip memory more efficiently to optimize parallel Smith-Waterman sequence alignment presented by Gregory M. Striemer. To minimize the cost of data transfers, on-chip shared memory is used to store intermediate results. Constant memory is also used effectively in our implementation of parallel Smith-Waterman algorithm. Using these two kinds of on-chip memory decreases long latency memory access operations, and reduces demand for global memory when aligning longer sequences. The experimental results show 1.66x to 3.16x speedup over Gregorys parallel SW on GPGPU in terms of execution time and 19.70x speedup on average and 22.43x speedup peak performance over serial SW in terms of clock cycles on our computer platform.

international parallel and distributed processing symposium | 2014

A Criticality-Aware DVFS Runtime Utility for Optimizing Power Efficiency of Multithreaded Applications

Haibo Zhang; Wenting Han; Feng Li; Songtao He; Yichao Cheng; Hong An; Zhitao Chen

The performance bottleneck in multithreaded programs usually depends on critical threads. We propose a runtime utility, which can find critical threads using state-of-the-art methods and optimize the power and performance by scaling frequency. In result, this runtime utility can help processors to achieve higher power-efficiency, which can earn about 14% energy and reduce 11.7% EDP comparing to ondemand offered by Linux kernel, in average. In the meanwhile, it consumes 15% and 25% less power than ondemand and performance, respectively. Since our utility only depends on DVFS and several performance monitor units (PMU) in CPU, which most modern processors provide, it can be easily ported to various systems.

intelligent information hiding and multimedia signal processing | 2009

Performance and Power Efficiency Analysis of the Symmetric Cryptograph on Two Stream Processor Architectures

Guang Xu; Hong An; Gu Liu; Ping Yao; Mu Xu; Wenting Han; Xiaoqiang Li; Xiurui Hao

Multimedia and some scientific applications have achieved good performance on the stream processor architecture by employing the stream programming model. In order to find out the way to accelerate the symmetric cryptograph on stream processor, we implement and analyze cryptograph algorithms on different stream processors in this paper. Four cipher algorithms including RC5, AES, TWOFISH and 3DES in ECB model are implemented on three platforms, which are stream processor SPI Storm SP16-G160, NVIDIA GeForce 9800GTX, Intel Core2 dual-core processor E7300. The difference of architecture between two stream processors and the character of programming model are described. When we compare throughput rate of these applications, 9800GTX is shown with 4-30x performance improvement over E7300, SP16 achieves the highest power efficiency and obtains 15-20x increase over E7300 in Gops/Watt.

international conference on supercomputing | 2012

Distributed replay protocol for distributed uniprocessors

Mengjie Mao; Hong An; Bobin Deng; Tao Sun; Xuechao Wei; Wei Zhou; Wenting Han

Data speculation technique has been heavily exploited in various scenarios of architecture design. It bridges the time or space gap between data producer and data consumer, which gives opportunities to processors to gain significant speedups. However, large instruction windows, deep pipeline and increasing latency of on-chip communication make data misspeculation very expensive in modern processors. This paper proposes a Distributed Replay Protocol(DRP) that addresses data misspeculation in a distributed uniprocessor, named TFlex. The partition feature of distributed uniprocessors aggravates the penalty of data misspeculation. After detecting misspeculation, DRP avoids squashing pipeline; on the contrary, it retains all instructions in the window and selectively replays the instructions that depend on the misspeculative data. As one possible use of DRP, We apply it to recovery from data dependence speculation. We also summarize the challenges of implementing selective replay mechanism on distributed uniprocessors, and then come up with two variations of DRP to effectively solve these challenges. The evaluation results show that without data speculation, DRP achieves 99% of the performance of perfect memory disambiguation. It speeds up diverse applications over baseline TFlex(with a state-of-art data dependence predictor) by a geometric mean of 24%.

parallel and distributed computing: applications and technologies | 2009

The Mapping Framework and Optimizing Strategy for Block Cryptography Algorithms on Cell Broadband Engine

Mu Xu; Hong An; Gu Liu; Yaobin Wang; Guang Xu; Ping Yao; Xiurui Hao; Wenting Han

The Cell Broadband Engine is a typical heterogeneous chip multiprocessor which provides potential high performance for computing-intensive applications. Our researches focus on how to use Cell to speed up block cryptography applications. In this paper, we propose a mapping framework for block cryptography working in ECB mode and corresponding optimizing strategy. We take four algorithms(RC5, 3DES, AES, and Twofish) as benchmark and implement these four algorithms using Cell programming language. In order to enhance the performance, we present an optimizing strategy and evaluate the effects of the optimizing methods including compiler optimization, dual buffering, vectorization, and loop unrolling. The experiments indicate that all these four algorithms can obtain 5-20 times speedup compared with traditional processors, which shows that our mapping framework and optimizing strategy are effective for the block cryptography algorithms.

international association of computer science and information technology | 2009

Localizing Loads Execution in a Data Cache Distributed Processor Architecture

Ruiling Dou; Hong An; Rui Guo; Wenting Han; Ming Cong

To scale with the on-chip wire delay, the inside of a processor will further be partitioned into more small banks. As the amount of data cache bank scales, the load routing latency will contribute a considerable portion to program executing time. In a tiled processor like T-Flex, we observed that the average load routing latency (hit in data cache) can be largely reduced (by 72.1%) when data is perfectly placed in where the load issues. To reduce the long routing latency of critical loads, we give out a solution for localizing load execution at issuing side in this paper. However, this method will induce overhead of data copies and extra communications to maintain coherence and memory order. We explore the design space for localizing data access, with special respect to maximizing benefits at expense of relatively small overhead. We observed the access frequency and load store behaviors for different copies vary largely, with large amount of successive load accesses concentrating on small amount of data blocks. Our experiments show that with special replication and data copy invalidation strategies, copying overhead will be controlled while maintaining considerable performance profits.

Explore More