Huayou Su
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Huayou Su.
acm multimedia | 2009
Nan Wu; Mei Wen; Wei Wu; Ju Ren; Huayou Su; Changqing Xun; Chunyuan Zhang
Programmable processors have great advantage over dedicated ASIC design under intense time-to-market pressure. However, real-time encoding of high-definition (HD) H.264 video (up to 1080p) is a challenge to most existing programmable processors. On the other hand, model-based design is widely accepted in developing complex media program. Stream model, an emerging model-based programming method, shows surprising efficiency on many compute-intensive domains especially for media processing. On the basis, this paper proposes a set of streaming techniques for H.264 encoding, and then develops all of the code based on the X264 reference code. Our streaming H.264 encoder is a pure software implementation completely written in high-level language without special hardware/algorithm support. Real execution results show that our encoder achieves significant speedup over the original X264 encoder on various programmable architectures: on X86 CoreTM2 E8200 the speedup is 1.8x, on MIPS 4KEc the speedup is 3.7x, on TMS320 C6416 DSP the speedup is 5.5x, on stream processor STORM-SP16 G220 the speedup is 6.1x. Especially, on STORM processor, the streaming encoder achieves the performance of 30.6 frames per second for a 1080P HD sequence, satisfying the real-time requirement. These indicate that streaming is extremely efficient for this kind of media workload. Our work is also applicable for other media processing applications, and provides architecture insights into dedicated ASIC or FPGA HD H.264 encoders.
international conference on parallel and distributed systems | 2012
Nan Wu; Mei Wen; Huayou Su; Ju Ren; Chunyuan Zhang
Efficient mapping of a real-time HD video application to graphics hardware is challenging. Developers face the challenges of choosing the right parallelism model, balancing threads process granularity between massive computing resources on the GPU, and partitioning tasks between the CPU and GPU. The paper illustrated the mapping approaches by a case of HD H.264 encoder based on X264 reference code and then evaluating it on state-of-the-art CPU and GPUs in depth. In the paper, we first split most of the computing task into Single-Instruction Multiple-Thread (SIMT) kernels, which are then chained intocertaininput/output data stream. Then we implementeda completed H.264 encoding on the computer unified device architecture (CUDA) platform. Finally, we present methods for exploiting multi-level parallelism and memory efficiency when mapping H.264 code, which we use to increase the efficiency of the execution on GPUs. Our experimental results show that computation efficiency of GPU and then real-time encoding performance are achieved with CUDA.
The Journal of Supercomputing | 2013
Jun Chai; Huayou Su; Mei Wen; Xing Cai; Nan Wu; Chunyuan Zhang
Bayesian inference is one of the most important methods for estimating phylogenetic trees in bioinformatics. Due to the potentially huge computational requirements, several parallel algorithms of Bayesian inference have been implemented to run on CPU-based clusters, multicore CPUs, or small clusters of CPUs and GPUs. To the best of our knowledge, however, none of the existing methods is able to simultaneously and fully utilize both CPUs and GPUs for the computations, leaving idle either the CPU part or the GPU part of modern heterogeneous supercomputers. Aiming at an optimized utilization of heterogeneous computing resources, which is a promising hardware architecture for future bioinformatics applications, we present a new hybrid parallel algorithm and implementation of Bayesian phylogenetic inference, which combines MPI, OpenMP, and CUDA programming. The novelty of our algorithm, denoted as oMC3, is its ability of using CPU cores simultaneously with GPUs for the computations, while ensuring a fair work division between the two types of hardware components. We have implemented oMC3 based on MrBayes, which is one of the most popular software packages for Bayesian phylogenetic inference. Numerical experiments show that oMC3 obtains 2.5× speedup over nMC3, which is a cutting-edge GPU implementation of MrBayes, on a single server consisting of two GPUs and sixteen CPU cores. Moreover, oMC3 scales nicely when 128 GPUs and 1536 CPU cores are in use.
The Scientific World Journal | 2014
Huayou Su; Mei Wen; Nan Wu; Ju Ren; Chunyuan Zhang
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIAs GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design.
international conference on image and graphics | 2011
Huayou Su; Nan Wu; Chunyuan Zhang; Mei Wen; Ju Ren
In this paper, we propose a multilevel parallel intra coding for H.264/AVC based on computed unified device architecture (CUDA). The proposed parallel algorithm improves the parallelism between 4x4 blocks within a macro block (MB) by throwing off some inappreciable prediction modes. By partitioning a frame into multi-slice, the parallelism between MBs can be exploited. In addition, a scalable parallel method for kernels is introduced to improve the performance of the proposed intra coding. Experimental results show that, more than 20 times speedup can be achieved with the assistance of GPU. Moreover, the entire encoder can meet the real-time processing requirement for HDTV.
international conference on cluster computing | 2012
Mei Wen; Huayou Su; Wenjie Wei; Nan Wu; Xing Cai; Chunyuan Zhang
In cutting-edge CPU/GPU hybrid clusters, such as Tianhe-1A, the aggregate CPU computing capability may amount to up to 1/3 of the aggregate GPU computing capability. It thus goes without saying that the CPUs and GPUs should jointly carry out the computational work. However, to effectively and simultaneously use both the hardware components requires great care when developing the parallel implementations. The challenges include (1) finding a balanced division of the workload between the CPU and GPU sides, and (2) hiding various overheads by overlapping computations with CPU-GPU data transfers and/or MPI communications. We study these issues in the context of real-world sedimentary basin simulations. Numerical experiments show that an appropriately devised CPU-GPU hybrid implementation is able to handle a global mesh resolution of 131,072*131,072, and a double-precision rate of 62 TFlops is achieved by using 1024 GPUs and 12288 CPU cores on Tianhe-1A. Such an extreme computing capability will be of great importance for carrying out high-resolution and continental-scale stratigraphic simulations in future.
2010 5th International Conference on Embedded and Multimedia Computing | 2010
Ju Ren; Yi He; Huayou Su; Mei Wen; Nan Wu; Chunyuan Zhang
Intra prediction is the most important intensive computing component in H.264 intra frame coder. Its high computational costs give huge pressure to most current embedded programmable processors, especially in real-time HD H.264 video encoding. Stream processing model, an emerging parallel processing model supported by GPUs and most programmable processors, bridges the gap between flexible programmable processors and high-performance special-purpose processors. This paper presents parallel streaming intra prediction on programmable processors. Two methods of 7-steps block parallel and macro block parallel are designed to deal with the restrictions of parallel. Experiment results of full HD H.264 encoding show that our parallel streaming intra prediction methods accomplish obvious speedup (more than 5 times) on different programmable processors. Moreover, the entire streaming H.264 encoder with our parallel intra prediction method achieves real-time performance for encoding full HD H.264 video sequence.
international supercomputing conference | 2013
Huayou Su; Nan Wu; Mei Wen; Chunyuan Zhang; Xing Cai
Aiming at a close examination of the OpenCL performance myth, we study in this paper OpenCL implementations of several representative 3D stencil computations. It is found that typical optimization techniques such as array padding, plane sweeping and chunking give similar performance boosts to the OpenCL implementations, as those obtained in corresponding CUDA programs. The key to good performance lies in maximizing the use of on-chip resources of a GPU, same for both OpenCL and CUDA programming. In most cases, the achieved FLOPS rates on NVIDIA’s Fermi and Kepler GPUs are fully comparable between the two programming alternatives. For four typical 3D stencil computations, the performance of the OpenCL implementations is on average 9% and 2% faster than that of the CUDA counterparts on GTX590 and Tesla K20, respectively. At the moment, the only clear advantage of CUDA programming for stencil computations arises from CUDA’s ability of using the read-only data cache on NVIDIA’s Kepler GPUs. The skepticism about OpenCL’s GPU performance thus seems unjustified for 3D stencil computations.
ieee international conference on high performance computing data and analytics | 2015
Jun Chai; Johan Hake; Nan Wu; Mei Wen; Xing Cai; Glenn T. Lines; Jing Yang; Huayou Su; Chunyuan Zhang; Xiangke Liao
Numerical simulation of subcellular Ca 2 + dynamics with a resolution down to one nanometre can be an important tool for discovering the physiological cause of many heart diseases. The requirement of enormous computational power, however, has made such simulations prohibitive so far. By using up to 12,288 Intel Xeon Phi 31S1P coprocessors on the new hybrid cluster Tianhe-2, which is the new number one supercomputer of the world, we have achieved 1.27 Pflop/s in double precision, which brings us much closer to the nanometre resolution. This is the result of efficiently using the hardware on different levels: (1) a single Xeon Phi (2) a single compute node that consists of a host and three coprocessors, and (3) a huge number of interconnected nodes. To overcome the challenge of programming Intel’s new many-integrated core (MIC) architecture, we have adopted techniques such as vectorization, hierarchical data blocking, register data reuse, offloading computations to the coprocessors, and pipelining computations with intra-/inter-node communications.
parallel and distributed computing: applications and technologies | 2012
Qiang Lan; Changqing Xun; Mei Wen; Huayou Su; Lifang Liu; Chunyuan Zhang
OpenCL provides unified programming interface for various parallel computing platforms. The OpenCL framework manifests good functional portability, the programs can be run on platforms supporting OpenCL programming without any modification. However, most of the OpenCL programs are optimized for massively parallel processors, such as GPU, its hard to achieve good performance on general multi-core processors without sophisticate modification to the GPU specific OpenCL programs. The major reason is the immense gap between CPU and GPU architecture. In this paper, we evaluate the performance portability of OpenCL programs between CPU and GPU, and analyse the reasons why GPU specific OpenCL programs are not fit for CPU. Based on the profiling, we proposed three optimization strategies for improving performance of GPU specific OpenCL programs on CPU, including increasing the granularity of task partition, optimizing the usage of memory hierarchy and block-based data accessing. In addition, we applied the proposed techniques on several benchmarks. The experimental results show that the performance of the optimized OpenCL programs achieve high performance in terms of speedup ratio from 2 to 4 on CPUs, when compared with their corresponding GPU specific ones.