Is this you? Create Your Porfile

Hong An

University of Science and Technology of China

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hong An is active.

Explore More

Publication

Featured researches published by Hong An.

frontier of computer science and technology | 2009

A Program Behavior Study of Block Cryptography Algorithms on GPGPU

Gu Liu; Hong An; Wenting Han; Guang Xu; Ping Yao; Mu Xu; Xiurui Hao; Yaobin Wang

Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using all kinds of optimization methods algorithmically, but the intrinsic reason which lies in GPU architectural features for this performance improvement. Thus we present a study of several block encryption algorithms(AES, TRI-DES, RC5, TWOFISH and the chained block cipher formed by their combinations) processing on GPU using CUDA. We introduce our CUDA implementations, and investigate the program behavioral characteristics and their impacts on the performance in four aspects. We find that the number of threads used by a CUDA program can affect the overall performance fundamentally. Many block encryption algorithms can benefit from the shared memory if the capacity is large enough to hold the lookup tables. The data stored in device memory should be organized purposely to avoid performance degradation. Besides, the communication between host and device may turn out to be the bottleneck of a program. Through these analyses we hope to find out an effective way to optimize a CUDA program, as well as to reveal some desirable architectural features to support block encryption applications better.

international conference on high performance computing and simulation | 2010

CuHMMer: A load-balanced CPU-GPU cooperative bioinformatics application

Ping Yao; Hong An; Mu Xu; Gu Liu; Xiaoqiang Li; Yaobin Wang; Wenting Han

GPUs have recently been used to accelerate data-parallel applications for they provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power. Most of those applications use CPU as a controller who decides when GPUs run the computing-intensive tasks. This CPU-control-GPU-compute pattern wastes much of CPUs computational power. In this paper, we present a new CPU-GPU cooperative pattern for bioinformatics applications which can use both of CPU and GPU to compute. This pattern includes two parts: 1) the load-balanced data structure which manages data to keep the computational efficiency of GPU high enough when the length distribution of sequences in a sequence database is very uneven; 2) multi-threaded code partition which schedules computing on CPU and GPU in a cooperative way. Using this pattern, we develop CuHMMer based on HMMER which is one of the most important algorithms in bioinformatics. The experimental result demonstrates that CuHMMer get 13x to 45x speed up over available CPU implementations and could also outperform the traditional CUDA implementations which use CPU-control-GPU-compute pattern.

international parallel and distributed processing symposium | 2012

A Speculative HMMER Search Implementation on GPU

Xiaoqiang Li; Wenting Han; Gu Liu; Hong An; Mu Xu; Wei Zhou; Qi Li

Due to the exponentially growing bioinformatics databases and rapidly popular of GPU for general purpose computing, it is promising to employ GPU techniques to accelerate the sequence search process. Hmmsearch from HMMER bioinformatics software package is a wildly used software tool for sensitive profile HMM (Hidden Markov Model) searches of biological sequence databases. In this paper, we implement a speculative hmmsearch implementation on NVIDIA Fermi GPU and apply various optimizations to it. We test the enhancements in our GPU implementation in order to demonstrate the effectiveness of optimization strategies. Result shows that our speculative hmmsearch implementation achieves up to 6.5x speedup over previous fast single-threaded SSE implementation.

acm sigplan symposium on principles and practice of parallel programming | 2012

FlexBFS: a parallelism-aware implementation of breadth-first search on GPU

Gu Liu; Hong An; Wenting Han; Xiaoqiang Li; Tao Sun; Wei Zhou; Xuechao Wei; Xulong Tang

In this paper, we present FlexBFS, a parallelism-aware implementation for breadth-first search on GPU. Our implementation can adjust the computation resources according to the feedback of available parallelism dynamically. We also optimized our program in three ways: (1)a simplified two-level queue management,(2)a combined kernel strategy and (3)a high-degree vertices specialization approach. Our experimental results show that it can achieve 3~20 times speedup against the fastest serial version, and can outperform the TBB based multi-threading CPU version and the previous most effective GPU version on all types of input graphs.

annual computer security applications conference | 2008

LogSPoTM: a scalable thread level speculation model based on transactional memory

Rui Guo; Hong An; Ruiling Dou; Ming Cong; Yaobin Wang; Qi Li

Thread level speculation (TLS) and transactional memory (TM) are both proposed to address the problem of productivity in multi-core era. Both of them require similar underlying support. In this paper, we propose a low-design-complexity approach to effective unified support for both TLS & TM, by extending a scalable TM model to support TLS. The baseline TM model chosen is LogTM. A distributed hardware arbitration mechanism is also proposed to intensify scalability. This method takes advantage of hardware resources introduced by TM, resulting simplified hardware design. Moreover, it provides rich semantics of both TLS and TM to programmers. Five representative benchmarks have been adopted to evaluate performance characteristics of our TLS model under different memory access patterns. Influence of design choices such as interconnection is also studied. The evaluation shows that the new system performs well in most of the benchmarks in spite of its simplicity, resulting average region speedups around 3.5 at four threads.

international conference on supercomputing | 2012

CRQ-based fair scheduling on composable multicore architectures

Tao Sun; Hong An; Tao Wang; Haibo Zhang; Xiufeng Sui

As different workloads require different processor resources for better execution efficiency, recent work has proposed composable chip multiprocessors (CCMPs), which provide the capability to configure different number and types of processing cores at system runtime. However, such composable architecture poses a new significant challenge to system scheduler, that is, how to ensure priority-based performance for each task (i.e. fairness), while exploiting the benefits of composability by dynamically changing the hardware configurations to match the parallelism requirements in running tasks (i.e. resource allocation). Current multicore schedulers fail to address this problem, as they traditionally assume fixed number and types of cores. In this work, we introduce centralized run queue (CRQ) and propose an efficiency-based algorithm to address the fair scheduling problem on CCMP. Firstly, instead of using distributed per-core run queues, this paper employs CRQ to simplify the scheduling and resource allocation decisions on CCMP, and proposes a pipeline-like scheduling mechanism to hide the large scheduling decision overhead on the centralized queue. Secondly, an efficiency-based dynamic priority (EDP) algorithm is proposed to keep fair scheduling on CCMP, which can not only provide homogenous tasks with performance proportional to their priorities, but also ensure equal-priority heterogeneous tasks to get equivalent performance slowdowns when running simultaneously. To evaluate our design, experimental studies are carried out to compare EDP on CCMP with several state-of-art fair schedulers on symmetric and asymmetric CMPs. Our simulation results demonstrate that, while providing good fairness, EDP on CCMP outperforms the best performing fair scheduler on fixed symmetric and asymmetric CMPs by as much as 11.8% in user-oriented performance, and by 12.5% in system throughput.

international symposium on parallel and distributed processing and applications | 2011

A Priority-Aware NoC to Reduce Squashes in Thread Level Speculation for Chip Multiprocessors

Wenbo Dai; Hong An; Qi Li; Gongming Li; Bobin Deng; Shilei Wu; Xiaomei Li; Yu Liu

Thread Level Speculation (TLS) is a technique aims at boosting the performance of sequential programs running on Chip Multiprocessors (CMPs) by automatically parallelizing them. It exempts programmers from the heavy task of parallel programming. But its performance may suffer from frequent squashing caused by inter-thread data dependency violation. In this paper, we propose a Network-on-Chip (NoC) in CMP that employs a priority-aware packet arbitration policy. Packet scheduling guided by such policy reduces the occurrence of TLS squashes. Simulation results with 5 applications show that our policy reduces squashes by 22% in best case and 15% on average. Moreover, our priority aware approach could be generalized to similar scenarios in which different threads running on CMP manifest different priorities.

bio-inspired computing: theories and applications | 2010

The optimization of parallel Smith-Waterman sequence alignment using on-chip memory of GPGPU

Qian Zhang; Hong An; Gu Liu; Wenting Han; Ping Yao; Mu Xu; Xiaoqiang Li

Memory optimization is an important strategy to gain high performance for sequence alignment implemented by CUDA on GPGPU. Smith-Waterman (SW) algorithm is the most sensitive algorithm widely used for local sequence alignment but very time consuming. Although several parallel methods have been used in some studies and shown good performances, advantages of GPGPU memory hierarchy are still not fully exploited. This paper presents a new parallel method on GPGPU using on-chip memory more efficiently to optimize parallel Smith-Waterman sequence alignment presented by Gregory M. Striemer. To minimize the cost of data transfers, on-chip shared memory is used to store intermediate results. Constant memory is also used effectively in our implementation of parallel Smith-Waterman algorithm. Using these two kinds of on-chip memory decreases long latency memory access operations, and reduces demand for global memory when aligning longer sequences. The experimental results show 1.66x to 3.16x speedup over Gregorys parallel SW on GPGPU in terms of execution time and 19.70x speedup on average and 22.43x speedup peak performance over serial SW in terms of clock cycles on our computer platform.

international conference on neural information processing | 2015

Optimization and Analysis of Parallel Back Propagation Neural Network on GPU Using CUDA

Yaobin Wang; Pingping Tang; Hong An; Zhiqin Liu; Kun Wang; Yong Zhou

Graphic Processing Unit (GPU) can achieve remarkable performance for dataset-oriented application such as Back Propagation Network (BPN) under reasonable task decomposition and memory optimization. However, advantages of GPU’s memory architecture are still not fully exploited to parallel BPN. In this paper, we develop and analyze a parallel implementation of a back propagation neural network using CUDA. It focuses on kernels optimization through the use of shared memory and suitable blocks dimensions. The implementation was tested with seven well-known benchmark data sets and the results show promising 33.8x to 64.3x speedups can be realized compared to a sequential implementation on a CPU.

advanced parallel programming technologies | 2007

Balancing thread partition for efficiently exploiting speculative thread-level parallelism

Yaobin Wang; Hong An; Bo Liang; Li Wang; Ming Cong

General-purpose computing is taking an irreversible step toward on-chip parallel architectures. One way to enhance the performance of chip multiprocessors is the use of thread-level speculation (TLS). Identifying the points where the speculative threads will be spawned becomes one of the critical issues of this kind of architectures. In this paper, a criterion for selecting the region to be speculatively executed is presented to identify potential sources of speculative parallelism in general-purpose programs. A dynamic profiling method has been provided to search a large space of TLS parallelization schemes and where parallelism was located within the application. We analyze key factors impacting speculative thread-level parallelism of SPEC CPU2000, evaluate whether a given application or parts of it are suitable for TLS technology, and study how to balance thread partition for efficiently exploiting speculative thread-level parallelism. It shows that the inter-thread data dependences are ubiquitous and the synchronization mechanism is necessary; Return value prediction and loop unrolling are important to improve performance. The information we got can be used to guide the thread partition of TLS.

Explore More