Is this you? Create Your Porfile

Huayou Su

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huayou Su is active.

Explore More

Publication

Featured researches published by Huayou Su.

acm multimedia | 2009

Streaming HD H.264 encoder on programmable processors

Nan Wu; Mei Wen; Wei Wu; Ju Ren; Huayou Su; Changqing Xun; Chunyuan Zhang

Programmable processors have great advantage over dedicated ASIC design under intense time-to-market pressure. However, real-time encoding of high-definition (HD) H.264 video (up to 1080p) is a challenge to most existing programmable processors. On the other hand, model-based design is widely accepted in developing complex media program. Stream model, an emerging model-based programming method, shows surprising efficiency on many compute-intensive domains especially for media processing. On the basis, this paper proposes a set of streaming techniques for H.264 encoding, and then develops all of the code based on the X264 reference code. Our streaming H.264 encoder is a pure software implementation completely written in high-level language without special hardware/algorithm support. Real execution results show that our encoder achieves significant speedup over the original X264 encoder on various programmable architectures: on X86 CoreTM2 E8200 the speedup is 1.8x, on MIPS 4KEc the speedup is 3.7x, on TMS320 C6416 DSP the speedup is 5.5x, on stream processor STORM-SP16 G220 the speedup is 6.1x. Especially, on STORM processor, the streaming encoder achieves the performance of 30.6 frames per second for a 1080P HD sequence, satisfying the real-time requirement. These indicate that streaming is extremely efficient for this kind of media workload. Our work is also applicable for other media processing applications, and provides architecture insights into dedicated ASIC or FPGA HD H.264 encoders.

international conference on parallel and distributed systems | 2012

A Parallel H.264 Encoder with CUDA: Mapping and Evaluation

Nan Wu; Mei Wen; Huayou Su; Ju Ren; Chunyuan Zhang

Efficient mapping of a real-time HD video application to graphics hardware is challenging. Developers face the challenges of choosing the right parallelism model, balancing threads process granularity between massive computing resources on the GPU, and partitioning tasks between the CPU and GPU. The paper illustrated the mapping approaches by a case of HD H.264 encoder based on X264 reference code and then evaluating it on state-of-the-art CPU and GPUs in depth. In the paper, we first split most of the computing task into Single-Instruction Multiple-Thread (SIMT) kernels, which are then chained intocertaininput/output data stream. Then we implementeda completed H.264 encoding on the computer unified device architecture (CUDA) platform. Finally, we present methods for exploiting multi-level parallelism and memory efficiency when mapping H.264 code, which we use to increase the efficiency of the execution on GPUs. Our experimental results show that computation efficiency of GPU and then real-time encoding performance are achieved with CUDA.

The Journal of Supercomputing | 2013

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference

Jun Chai; Huayou Su; Mei Wen; Xing Cai; Nan Wu; Chunyuan Zhang

Bayesian inference is one of the most important methods for estimating phylogenetic trees in bioinformatics. Due to the potentially huge computational requirements, several parallel algorithms of Bayesian inference have been implemented to run on CPU-based clusters, multicore CPUs, or small clusters of CPUs and GPUs. To the best of our knowledge, however, none of the existing methods is able to simultaneously and fully utilize both CPUs and GPUs for the computations, leaving idle either the CPU part or the GPU part of modern heterogeneous supercomputers. Aiming at an optimized utilization of heterogeneous computing resources, which is a promising hardware architecture for future bioinformatics applications, we present a new hybrid parallel algorithm and implementation of Bayesian phylogenetic inference, which combines MPI, OpenMP, and CUDA programming. The novelty of our algorithm, denoted as oMC3, is its ability of using CPU cores simultaneously with GPUs for the computations, while ensuring a fair work division between the two types of hardware components. We have implemented oMC3 based on MrBayes, which is one of the most popular software packages for Bayesian phylogenetic inference. Numerical experiments show that oMC3 obtains 2.5× speedup over nMC3, which is a cutting-edge GPU implementation of MrBayes, on a single server consisting of two GPUs and sixteen CPU cores. Moreover, oMC3 scales nicely when 128 GPUs and 1536 CPU cores are in use.

The Scientific World Journal | 2014

Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Huayou Su; Mei Wen; Nan Wu; Ju Ren; Chunyuan Zhang

Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIAs GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design.

international conference on image and graphics | 2011

A Multilevel Parallel Intra Coding for H.264/AVC Based on CUDA

Huayou Su; Nan Wu; Chunyuan Zhang; Mei Wen; Ju Ren

In this paper, we propose a multilevel parallel intra coding for H.264/AVC based on computed unified device architecture (CUDA). The proposed parallel algorithm improves the parallelism between 4x4 blocks within a macro block (MB) by throwing off some inappreciable prediction modes. By partitioning a frame into multi-slice, the parallelism between MBs can be exploited. In addition, a scalable parallel method for kernels is introduced to improve the performance of the proposed intra coding. Experimental results show that, more than 20 times speedup can be achieved with the assistance of GPU. Moreover, the entire encoder can meet the real-time processing requirement for HDTV.

international conference on cluster computing | 2012

Using 1000+ GPUs and 10000+ CPUs for Sedimentary Basin Simulations

Mei Wen; Huayou Su; Wenjie Wei; Nan Wu; Xing Cai; Chunyuan Zhang

In cutting-edge CPU/GPU hybrid clusters, such as Tianhe-1A, the aggregate CPU computing capability may amount to up to 1/3 of the aggregate GPU computing capability. It thus goes without saying that the CPUs and GPUs should jointly carry out the computational work. However, to effectively and simultaneously use both the hardware components requires great care when developing the parallel implementations. The challenges include (1) finding a balanced division of the workload between the CPU and GPU sides, and (2) hiding various overheads by overlapping computations with CPU-GPU data transfers and/or MPI communications. We study these issues in the context of real-world sedimentary basin simulations. Numerical experiments show that an appropriately devised CPU-GPU hybrid implementation is able to handle a global mesh resolution of 131,072*131,072, and a double-precision rate of 62 TFlops is achieved by using 1024 GPUs and 12288 CPU cores on Tianhe-1A. Such an extreme computing capability will be of great importance for carrying out high-resolution and continental-scale stratigraphic simulations in future.

2010 5th International Conference on Embedded and Multimedia Computing | 2010

Parallel Streaming Intra Prediction for Full HD H.264 Encoding

Ju Ren; Yi He; Huayou Su; Mei Wen; Nan Wu; Chunyuan Zhang

Intra prediction is the most important intensive computing component in H.264 intra frame coder. Its high computational costs give huge pressure to most current embedded programmable processors, especially in real-time HD H.264 video encoding. Stream processing model, an emerging parallel processing model supported by GPUs and most programmable processors, bridges the gap between flexible programmable processors and high-performance special-purpose processors. This paper presents parallel streaming intra prediction on programmable processors. Two methods of 7-steps block parallel and macro block parallel are designed to deal with the restrictions of parallel. Experiment results of full HD H.264 encoding show that our parallel streaming intra prediction methods accomplish obvious speedup (more than 5 times) on different programmable processors. Moreover, the entire streaming H.264 encoder with our parallel intra prediction method achieves real-time performance for encoding full HD H.264 video sequence.

international supercomputing conference | 2013

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL

Huayou Su; Nan Wu; Mei Wen; Chunyuan Zhang; Xing Cai

Aiming at a close examination of the OpenCL performance myth, we study in this paper OpenCL implementations of several representative 3D stencil computations. It is found that typical optimization techniques such as array padding, plane sweeping and chunking give similar performance boosts to the OpenCL implementations, as those obtained in corresponding CUDA programs. The key to good performance lies in maximizing the use of on-chip resources of a GPU, same for both OpenCL and CUDA programming. In most cases, the achieved FLOPS rates on NVIDIA’s Fermi and Kepler GPUs are fully comparable between the two programming alternatives. For four typical 3D stencil computations, the performance of the OpenCL implementations is on average 9% and 2% faster than that of the CUDA counterparts on GTX590 and Tesla K20, respectively. At the moment, the only clear advantage of CUDA programming for stencil computations arises from CUDA’s ability of using the read-only data cache on NVIDIA’s Kepler GPUs. The skepticism about OpenCL’s GPU performance thus seems unjustified for 3D stencil computations.

ieee international conference on high performance computing data and analytics | 2015

Towards simulation of subcellular calcium dynamics at nanometre resolution

Jun Chai; Johan Hake; Nan Wu; Mei Wen; Xing Cai; Glenn T. Lines; Jing Yang; Huayou Su; Chunyuan Zhang; Xiangke Liao

Numerical simulation of subcellular Ca 2 + dynamics with a resolution down to one nanometre can be an important tool for discovering the physiological cause of many heart diseases. The requirement of enormous computational power, however, has made such simulations prohibitive so far. By using up to 12,288 Intel Xeon Phi 31S1P coprocessors on the new hybrid cluster Tianhe-2, which is the new number one supercomputer of the world, we have achieved 1.27 Pflop/s in double precision, which brings us much closer to the nanometre resolution. This is the result of efficiently using the hardware on different levels: (1) a single Xeon Phi (2) a single compute node that consists of a host and three coprocessors, and (3) a huge number of interconnected nodes. To overcome the challenge of programming Intel’s new many-integrated core (MIC) architecture, we have adopted techniques such as vectorization, hierarchical data blocking, register data reuse, offloading computations to the coprocessors, and pipelining computations with intra-/inter-node communications.

parallel and distributed computing: applications and technologies | 2012

Improving Performance of GPU Specific OpenCL Program on CPUs

Qiang Lan; Changqing Xun; Mei Wen; Huayou Su; Lifang Liu; Chunyuan Zhang

OpenCL provides unified programming interface for various parallel computing platforms. The OpenCL framework manifests good functional portability, the programs can be run on platforms supporting OpenCL programming without any modification. However, most of the OpenCL programs are optimized for massively parallel processors, such as GPU, its hard to achieve good performance on general multi-core processors without sophisticate modification to the GPU specific OpenCL programs. The major reason is the immense gap between CPU and GPU architecture. In this paper, we evaluate the performance portability of OpenCL programs between CPU and GPU, and analyse the reasons why GPU specific OpenCL programs are not fit for CPU. Based on the profiling, we proposed three optimization strategies for improving performance of GPU specific OpenCL programs on CPU, including increasing the granularity of task partition, optimizing the usage of memory hierarchy and block-based data accessing. In addition, we applied the proposed techniques on several benchmarks. The experimental results show that the performance of the optimized OpenCL programs achieve high performance in terms of speedup ratio from 2 to 4 on CPUs, when compared with their corresponding GPU specific ones.

Explore More