Yaobin Wang
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yaobin Wang.
frontier of computer science and technology | 2009
Gu Liu; Hong An; Wenting Han; Guang Xu; Ping Yao; Mu Xu; Xiurui Hao; Yaobin Wang
Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using all kinds of optimization methods algorithmically, but the intrinsic reason which lies in GPU architectural features for this performance improvement. Thus we present a study of several block encryption algorithms(AES, TRI-DES, RC5, TWOFISH and the chained block cipher formed by their combinations) processing on GPU using CUDA. We introduce our CUDA implementations, and investigate the program behavioral characteristics and their impacts on the performance in four aspects. We find that the number of threads used by a CUDA program can affect the overall performance fundamentally. Many block encryption algorithms can benefit from the shared memory if the capacity is large enough to hold the lookup tables. The data stored in device memory should be organized purposely to avoid performance degradation. Besides, the communication between host and device may turn out to be the bottleneck of a program. Through these analyses we hope to find out an effective way to optimize a CUDA program, as well as to reveal some desirable architectural features to support block encryption applications better.
international conference on high performance computing and simulation | 2010
Ping Yao; Hong An; Mu Xu; Gu Liu; Xiaoqiang Li; Yaobin Wang; Wenting Han
GPUs have recently been used to accelerate data-parallel applications for they provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power. Most of those applications use CPU as a controller who decides when GPUs run the computing-intensive tasks. This CPU-control-GPU-compute pattern wastes much of CPUs computational power. In this paper, we present a new CPU-GPU cooperative pattern for bioinformatics applications which can use both of CPU and GPU to compute. This pattern includes two parts: 1) the load-balanced data structure which manages data to keep the computational efficiency of GPU high enough when the length distribution of sequences in a sequence database is very uneven; 2) multi-threaded code partition which schedules computing on CPU and GPU in a cooperative way. Using this pattern, we develop CuHMMer based on HMMER which is one of the most important algorithms in bioinformatics. The experimental result demonstrates that CuHMMer get 13x to 45x speed up over available CPU implementations and could also outperform the traditional CUDA implementations which use CPU-control-GPU-compute pattern.
annual computer security applications conference | 2008
Rui Guo; Hong An; Ruiling Dou; Ming Cong; Yaobin Wang; Qi Li
Thread level speculation (TLS) and transactional memory (TM) are both proposed to address the problem of productivity in multi-core era. Both of them require similar underlying support. In this paper, we propose a low-design-complexity approach to effective unified support for both TLS & TM, by extending a scalable TM model to support TLS. The baseline TM model chosen is LogTM. A distributed hardware arbitration mechanism is also proposed to intensify scalability. This method takes advantage of hardware resources introduced by TM, resulting simplified hardware design. Moreover, it provides rich semantics of both TLS and TM to programmers. Five representative benchmarks have been adopted to evaluate performance characteristics of our TLS model under different memory access patterns. Influence of design choices such as interconnection is also studied. The evaluation shows that the new system performs well in most of the benchmarks in spite of its simplicity, resulting average region speedups around 3.5 at four threads.
international conference on neural information processing | 2015
Yaobin Wang; Pingping Tang; Hong An; Zhiqin Liu; Kun Wang; Yong Zhou
Graphic Processing Unit (GPU) can achieve remarkable performance for dataset-oriented application such as Back Propagation Network (BPN) under reasonable task decomposition and memory optimization. However, advantages of GPU’s memory architecture are still not fully exploited to parallel BPN. In this paper, we develop and analyze a parallel implementation of a back propagation neural network using CUDA. It focuses on kernels optimization through the use of shared memory and suitable blocks dimensions. The implementation was tested with seven well-known benchmark data sets and the results show promising 33.8x to 64.3x speedups can be realized compared to a sequential implementation on a CPU.
advanced parallel programming technologies | 2007
Yaobin Wang; Hong An; Bo Liang; Li Wang; Ming Cong
General-purpose computing is taking an irreversible step toward on-chip parallel architectures. One way to enhance the performance of chip multiprocessors is the use of thread-level speculation (TLS). Identifying the points where the speculative threads will be spawned becomes one of the critical issues of this kind of architectures. In this paper, a criterion for selecting the region to be speculatively executed is presented to identify potential sources of speculative parallelism in general-purpose programs. A dynamic profiling method has been provided to search a large space of TLS parallelization schemes and where parallelism was located within the application. We analyze key factors impacting speculative thread-level parallelism of SPEC CPU2000, evaluate whether a given application or parts of it are suitable for TLS technology, and study how to balance thread partition for efficiently exploiting speculative thread-level parallelism. It shows that the inter-thread data dependences are ubiquitous and the synchronization mechanism is necessary; Return value prediction and loop unrolling are important to improve performance. The information we got can be used to guide the thread partition of TLS.
international conference on algorithms and architectures for parallel processing | 2015
Yaobin Wang; Hong An; Zhiqin Liu; Lei Zhang; Qingfeng Wang
Although block cryptography algorithms have been parallelized into different platforms, they have not yet been explored on speculative multicore architecture thoroughly, especially under CBC, CFB and OFB modes. This paper presents a study of parallelizing several block cryptography algorithms (AES, 3DES, RC5 and TWOFISH) on a novel speculative multicore architecture, including its speculative execution mechanism, architectural design and programming model. It illustrates both application and kernel level speculative speedups in these applications under all ECB, CBC, CFB and OFB modes. The experimental results show that: (1) in ECB mode, all the block cryptography algorithms perform well on speculative multicore platform. It can achieve similar speedup compared with graphics processors (GPU) while provides a more friendly programmability. (2) In CBC and CFB modes, decryption kernel in these applications still can get a promising 15.6x–25.6x speedup. (3) 32 cores’ computing resources can be used efficiently in the model.
international conference on algorithms and architectures for parallel processing | 2010
Hong An; Tao Sun; Ming Cong; Yaobin Wang
Technology evolving has forced the coming of chip multiprocessors (CMP) era, and enabled architects to place an increasing number of cores on single chip. For the abundance of computing resources, a fundamental problem is how to map application on it, or how many cores should be assigned for each application. As the available concurrency varies widely for diverse applications or different execution phases of an individual program, the number of resource allocated should be adjusted dynamically for high utilization rate while not compromising performance. In this paper, aiming at resource management in flexible architecture, an implementation of confidence predictor, referred as speculative depth estimator (SDE), is introduced, which is able to conduct the real-time resource tuning. By applying the speculative depth estimator to dynamic resource tuning, the experiments results show that a good trade-off between concurrency exploitation and resource utilization is achieved.
parallel and distributed computing: applications and technologies | 2009
Mu Xu; Hong An; Gu Liu; Yaobin Wang; Guang Xu; Ping Yao; Xiurui Hao; Wenting Han
The Cell Broadband Engine is a typical heterogeneous chip multiprocessor which provides potential high performance for computing-intensive applications. Our researches focus on how to use Cell to speed up block cryptography applications. In this paper, we propose a mapping framework for block cryptography working in ECB mode and corresponding optimizing strategy. We take four algorithms(RC5, 3DES, AES, and Twofish) as benchmark and implement these four algorithms using Cell programming language. In order to enhance the performance, we present an optimizing strategy and evaluate the effects of the optimizing methods including compiler optimization, dual buffering, vectorization, and loop unrolling. The experiments indicate that all these four algorithms can obtain 5-20 times speedup compared with traditional processors, which shows that our mapping framework and optimizing strategy are effective for the block cryptography algorithms.
annual computer security applications conference | 2008
Li Wang; Hong An; Yaobin Wang
Dataflow predication provides a lightweight full support for predicated execution in dataflow-like architectures. One of its major overhead is the large amounts of fanout trees for distributing predicates to all dependant instructions. Conventional optimizations are predicating only the heads or tails of dataflow chains. Predicating tails offers more speculation but leads to resource contentions and power consumption increasing. Predicating heads is power efficient but reduces speculation and instruction level parallelism. This paper introduces a profile guided technique to combine these optimizations. It uses profiling feedback to guide the compiler in deciding to predicate at the head or tail. By predicating tails on hot paths and predicating heads on infrequent paths, this technique can get performance, power and resource efficiency. Performance evaluation result shows that profile guided optimization performs better in removing fanout trees. It has 10.6% speedup over always predicating heads and 2.5% speedup over always predicating tails in performance.
frontier of computer science and technology | 2009
Yaobin Wang; Hong An; Jie Yan; Qi Li; Wenting Han; Li Wang; Gu Liu