Is this you? Create Your Porfile

Yaohua Wang

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yaohua Wang is active.

Explore More

Publication

Featured researches published by Yaohua Wang.

high performance computing and communications | 2010

Mapping of H.264/AVC Encoder on a Hierarchical Chip Multicore DSP Platform

Shenggang Chen; Shuming Chen; Huitao Gu; Hu Chen; Yaming Yin; Xiaowen Chen; Shuwei Sun; Sheng Liu; Yaohua Wang

The emergence of large-scale chip multicore processors makes the on-chip parallel H.264/AVC encoder with high parallelism feasible. To reduce the data reload frequency, a hierarchical chip multi-core DSP platform with overall 64 DSP cores is designed to accommodate the computation/data-intensive H.264/AVC encoder. To increase parallelism, macro block level parallelism is exploited in this paper and wave front algorithm is utilized. Centralized shared memory in super nodes of this hierarchical DSP platform affords larger local space to hold the frequently used data and reduce bandwidth requirement. Subtask level parallelism within motion estimation, intra prediction and mode decision is further exploited to keep the DSP cores in a super node busy even only one macro block are assigned to a super node. Because of lack of available macro blocks in filling and emptying stages when encoding a frame, super nodes cannot be kept busy all the time and speedups of 13, 24, 26 and 49 are achieved for QCIF, SIF, CIF and HD sequences, respectively. To further improve the speedups and make best use of the processor resources, frame level parallelism should be exploited with carefully tuned memory allocation policy.

high-performance computer architecture | 2013

A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments

Yaohua Wang; Shuming Chen; Jianghua Wan; Jiayuan Meng; Kai Zhang; Wei Liu; Xi Ning

The efficacy of widely used single instruction, multiple data architectures is often limited when handling divergent control flows and short vectors; both circumstances result in SIMD fragments that use only a subset of the available datapaths. This paper proposes a multiple SIMD, multiple data (MSMD) architecture with flexible SIMD datapaths that can be dynamically or statically repartitioned among multiple control flow paths, all executing simultaneously. The benefits are twofold: SIMD fragments resulting from divergent branches can execute in parallel, as can multiple kernels with short vectors. The resulting SIMD architecture can achieve the flexibility similar to a multiple instruction, multiple data architecture. We have both simulated the architecture and implemented a prototype. Our experiments with data-parallel benchmarks show that the architecture leads to 60% performance gains with an area overhead of only 3.06%.

IEEE Micro | 2014

FT-Matrix: A Coordination-aware Architecture for Signal Processing

Shuming Chen; Yaohua Wang; Sheng Liu; Jianghua Wan; Haiyan Chen; Hengzhu Liu; Kai Zhang; Xiangyuan Liu; Xi Ning

Vector-SIMD architectures have gained increasing attention because of their high performance in signal-processing applications. However, the performance of existing vector-SIMD architectures remains limited because of their inefficiency in the coordinated exploitation of different hardware units. To solve this problem, this article proposes the FT-Matrix architecture, which improves the coordination of traditional vector-SIMD architectures from three aspects: the cooperation between the scalar and SIMD unit is refined with the dynamic coupling execution scheme, the communication among SIMD lanes is enhanced with the matrix-style communication, and data sharing among vector memory banks is accomplished by the unaligned vector memory accessing scheme. Evaluation results show an average performance gain of 58.5 percent against vector-SIMD architectures without the proposed improvements. A four-core chip with each core built on the FT-Matrix architecture is also under fabrication.

IEICE Electronics Express | 2012

CMRF: a Configurable Matrix Register File for accelerating matrix operations on SIMD processors

Kai Zhang; Shuming Chen; Hu Chen; Yaohua Wang; Xiaowen Chen; Sheng Liu; Wei Liu

Mapping matrix operations on SIMD processors brings a large amount of data rearrangement that decrease the system performance. This paper proposes a Configurable Matrix Register File (CMRF) that supports both row-wise and column-wise accesses. The CMRF can be dynamically configured into different operating modes in which one or several sub-matrices can be accessed in parallel. Experimental results show that, compared with the traditional Vector Register File (VRF) and the MRF, the CMRF can respectively achieve about 2.21x and 1.6x average performance improvement. Compared with TMS320C64x+, our SIMD processor can achieve about 5.65x to 7.71x performance improvement by employing the CMRF.

international conference on asic | 2011

Accelerating the data shuffle operations for FFT algorithms on SIMD DSPs

Kai Zhang; Shuming Chen; Sheng Liu; Yaohua Wang; Junhui Huang

FFT is a key kernel of OFDM in the 3GPP-LTE system. Many researchers employ SIMD DSPs to accelerate FFT algorithms by the feature that there is about 75% SIMD workloads in them. This paper makes a detailed analysis on how to accelerate FFT algorithms on SIMD DSPs. We propose an EXC instruction for SIMD DSPs. The EXC instruction can exchange the specified elements between two vector registers in one cycle. It can achieve performance benefits ranging from 1.18× to 1.37× and reduce the dynamic code size by up to 15% compared with the vhalfup and vhalfdn instructions which are implemented in VIRAM processor. Moreover, two useful suggestions are presented in this paper for designing the architecture oriented to the 3G/4G wireless communication systems.

ieee computer society annual symposium on vlsi | 2016

Mod (2P-1) Shuffle Memory-Access Instructions for FFTs on Vector SIMD DSPs.

Sheng Liu; Haiyan Chen; Jianghua Wan; Yaohua Wang

Binary Exchange Algorithm (BEA) always introduces excessive shuffle operations when mapping FFTs on vector SIMD DSPs. This can greatly restrict the overall performance. We propose a novel mod (2P-1) shuffle function and Mod-BEA algorithm (MBEA), which can halve the shuffle operation count and unify the shuffle mode. Such unified shuffle mode inspires us to propose a set of novel mod (2P-1) shuffle memory-access instructions, which can totally eliminate the shuffle operations. Experimental results show that the combination of MBEA and the proposed instructions can bring 17.2%-31.4% performance improvements at reasonable hardware cost, and compress the code size by about 30%.

ACM Transactions on Architecture and Code Optimization | 2016

Iteration Interleaving--Based SIMD Lane Partition

Yaohua Wang; Dong Wang; Shuming Chen; Zonglin Liu; Shenggang Chen; Xiaowen Chen; Xu Zhou

The efficacy of single instruction, multiple data (SIMD) architectures is limited when handling divergent control flows. This circumstance results in SIMD fragments using only a subset of the available lanes. We propose an iteration interleaving--based SIMD lane partition (IISLP) architecture that interleaves the execution of consecutive iterations and dynamically partitions SIMD lanes into branch paths with comparable execution time. The benefits are twofold: SIMD fragments under divergent branches can execute in parallel, and the pathology of fragment starvation can also be well eliminated. Our experiments show that IISLP doubles the performance of a baseline mechanism and provides a speedup of 28% versus instruction shuffle.

IEICE Electronics Express | 2013

Breaking the performance bottleneck of sparse matrix-vector multiplication on SIMD processors

Kai Zhang; Shuming Chen; Yaohua Wang; Jianghua Wan

The low utilization of SIMD units and memory bandwidth is the main performance bottleneck on SIMD processors for sparse matrix-vector multiplication (SpMV), which is one of the most important kernels in many scientific and engineering applications. This paper proposes a hybrid optimization method to break the performance bottleneck of SpMV on SIMD processors. The method includes a new sparse matrix compressed format, a block SpMV algorithm, and a vector write buffer. Experimental results show that our hybrid optimization method can achieve an average speedup of 2.09 over CSR vector kernel for all the matrices. The maximum speedup can go up to 3.24.

ieee computer society annual symposium on vlsi | 2016

Dynamic Per-Warp Reconvergence Stack for Efficient Control Flow Handling in GPUs

Yaohua Wang; Xiaowen Chen; Dong Wang; Sheng Liu

GPGPUs usually experience performance degradation when the control flow of threads diverges in a warp. Reconvergence stack based control flow handling scheme is widely adopted in GPU architectures. The depth of such stack is always set to a large number, so that there can be enough entries for warps experiencing nested branches. However, for warps experiencing simple branches or even no branches, those deep reconvergence stacks would stay idle, causing a serious waste of hardware resource. Moreover, with the development of GPU architectures, more and more warps will be deployed on a GPU stream processor core, such problem could be even more serious. To solve this problem, this paper propose a dynamic reconvergence stack structure, in which a stack pool is shared by all the warps, and dynamic stacks of different warps can be constructed according to the run-time requirement. This can satisfy the stack requirement while eliminating unnecessary waste of hardware resource. Our experiments show that the dynamic reconvergence stack can reduce the cost of stack by 50% with the conventional performance well maintained.

IEICE Electronics Express | 2015

B-SCT: Improve SpMV processing on SIMD architectures

Yaohua Wang; Dong Wang; Xu Zhou

Sparse matrix-vector multiplication (SpMV) represents the dominant cost in sparse linear algebra. However, sparse matrices exhibit inherent irregularity in both amount and distribution of none-zero values. This harnessed the tremendous potential of Single Instruction Multiple Data (SIMD) architectures, which is widely adopted in nowadays data-parallel processors. To improve the performance of SpMV, we proposed the Balanced SCT (B-SCT) method. The cornerstones are composed of the balanced-aware compression scheme and the on-the-fly data re-order structure. Our simulation results show that the B-SCT method provides an average speed-up of 130% over the commonly used CSR method, and 83% over the SIMD-oriented SCT method.

Explore More