Is this you? Create Your Porfile

Jianghua Wan

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jianghua Wan is active.

Explore More

Publication

Featured researches published by Jianghua Wan.

high-performance computer architecture | 2013

A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments

Yaohua Wang; Shuming Chen; Jianghua Wan; Jiayuan Meng; Kai Zhang; Wei Liu; Xi Ning

The efficacy of widely used single instruction, multiple data architectures is often limited when handling divergent control flows and short vectors; both circumstances result in SIMD fragments that use only a subset of the available datapaths. This paper proposes a multiple SIMD, multiple data (MSMD) architecture with flexible SIMD datapaths that can be dynamically or statically repartitioned among multiple control flow paths, all executing simultaneously. The benefits are twofold: SIMD fragments resulting from divergent branches can execute in parallel, as can multiple kernels with short vectors. The resulting SIMD architecture can achieve the flexibility similar to a multiple instruction, multiple data architecture. We have both simulated the architecture and implemented a prototype. Our experiments with data-parallel benchmarks show that the architecture leads to 60% performance gains with an area overhead of only 3.06%.

IEEE Micro | 2014

FT-Matrix: A Coordination-aware Architecture for Signal Processing

Shuming Chen; Yaohua Wang; Sheng Liu; Jianghua Wan; Haiyan Chen; Hengzhu Liu; Kai Zhang; Xiangyuan Liu; Xi Ning

Vector-SIMD architectures have gained increasing attention because of their high performance in signal-processing applications. However, the performance of existing vector-SIMD architectures remains limited because of their inefficiency in the coordinated exploitation of different hardware units. To solve this problem, this article proposes the FT-Matrix architecture, which improves the coordination of traditional vector-SIMD architectures from three aspects: the cooperation between the scalar and SIMD unit is refined with the dynamic coupling execution scheme, the communication among SIMD lanes is enhanced with the matrix-style communication, and data sharing among vector memory banks is accomplished by the unaligned vector memory accessing scheme. Evaluation results show an average performance gain of 58.5 percent against vector-SIMD architectures without the proposed improvements. A four-core chip with each core built on the FT-Matrix architecture is also under fabrication.

ieee computer society annual symposium on vlsi | 2016

Mod (2P-1) Shuffle Memory-Access Instructions for FFTs on Vector SIMD DSPs.

Sheng Liu; Haiyan Chen; Jianghua Wan; Yaohua Wang

Binary Exchange Algorithm (BEA) always introduces excessive shuffle operations when mapping FFTs on vector SIMD DSPs. This can greatly restrict the overall performance. We propose a novel mod (2P-1) shuffle function and Mod-BEA algorithm (MBEA), which can halve the shuffle operation count and unify the shuffle mode. Such unified shuffle mode inspires us to propose a set of novel mod (2P-1) shuffle memory-access instructions, which can totally eliminate the shuffle operations. Experimental results show that the combination of MBEA and the proposed instructions can bring 17.2%-31.4% performance improvements at reasonable hardware cost, and compress the code size by about 30%.

IEICE Electronics Express | 2013

Breaking the performance bottleneck of sparse matrix-vector multiplication on SIMD processors

Kai Zhang; Shuming Chen; Yaohua Wang; Jianghua Wan

The low utilization of SIMD units and memory bandwidth is the main performance bottleneck on SIMD processors for sparse matrix-vector multiplication (SpMV), which is one of the most important kernels in many scientific and engineering applications. This paper proposes a hybrid optimization method to break the performance bottleneck of SpMV on SIMD processors. The method includes a new sparse matrix compressed format, a block SpMV algorithm, and a vector write buffer. Experimental results show that our hybrid optimization method can achieve an average speedup of 2.09 over CSR vector kernel for all the matrices. The maximum speedup can go up to 3.24.

asia and south pacific design automation conference | 2011

Design and chip implementation of a heterogeneous multi-core DSP

Shuming Chen; Xiaowen Chen; Yi Xu; Jianghua Wan; Jianzhuang Lu; Xiangyuan Liu; Shenggang Chen

This paper presents a novel heterogeneous multi-core Digital Signal Processor, named YHFT-QDSP, hosting one RISC CPU core and four VLIW DSP cores. The CPU core is responsible for task scheduling and management, while the DSP cores take charge of speeding up data processing. The YHFT-QDSP provides three kinds of interconnection communication. One is for inner-chip communication between the CPU core and the four DSP cores, the other two for both inner-chip and inter-chip communication amongst DSP cores. The YHFT-QDSP is implemented under SMIC® 130nm LVT CMOS technology and can run [email protected] with 114.49 mm2 die area.

ieee computer society annual symposium on vlsi | 2015

Achieving Memory Access Equalization Via Round-Trip Routing Latency Prediction in 3D Many-Core NoCs

Xiaowen Chen; Zhonghai Lu; Yang Li; Axel Jantsch; Xueqian Zhao; Shuming Chen; Yang Guo; Zonglin Liu; Jianzhuang Lu; Jianghua Wan; Shuwei Sun; Shenggang Chen; Hu Chen

3D many-core NoCs are emerging architectures for future high-performance single chips due to its integration of many processor cores and memories by stacking multiple layers. In such architecture, because processor cores and memories reside in different locations (center, corner, edge, etc.), memory accesses behave differently due to their different communication distances, and the performance (latency) gap of different memory accesses becomes larger as the network size is scaled up. This phenomenon may lead to very high latencies suffered from by some memory accesses, thus degrading the system performance. To achieve high performance, it is crucial to reduce the number of memory accesses with very high latencies. However, this should be done with care since shortening the latency of one memory access can worsen the latency of another as a result of shared network resources. Therefore, the goal should focus on narrowing the latency difference of memory accesses. In the paper, we address the goal by proposing to prioritize the memory access packets based on predicting the round-trip routing latencies of memory accesses. The communication distance and the number of the occupied items in the buffers in the remaining routing path are used to predict the round-trip latency of a memory access. The predicted round-trip routing latency is used as the base to arbitrate the memory access packets so that the memory access with potential high latency can be transferred as early and fast as possible, thus equalizing the memory access latencies as much as possible. Experiments with varied network sizes and packet injection rates prove that our approach can achieve the goal of memory access equalization and outperforms the classic round-robin arbitration in terms of maximum latency, average latency, and LSD1. In the experiments, the maximum improvement of the maximum latency, the average latency and the LSD are 80%, 14%, and 45% respectively.

international symposium on circuits and systems | 2013

Redefining the relationship between scalar and parallel units in SIMD architectures

Yaohua Wang; Shuming Chen; Jianghua Wan; Kai Zhang

SIMD architectures, comprising of both scalar and parallel units, have been widely used in media processors. To further improve the performance, much effort has been made to enhance the design of both units, while little attention has been placed on the relationship between the units. This paper demonstrates that a dynamic coupling mechanism, which can dynamically transform the scalar and parallel units between loosely and tightly coupled relationships, can achieve a better matching between existing SIMD architectures and media applications. The evaluation result shows that the dynamic coupling mechanism can achieve an average performance gain of 33.2% at an additional area cost of 4.57%.

IEICE Electronics Express | 2012

A cost conscious performance model for media processors

Yaohua Wang; Shuming Chen; Kai Zhang; Jianghua Wan; Hu Chen; Sheng Liu; Xi Ning

The combination of multi-core, SIMD and VLIW schemes is becoming prevailing in today’s media processor architectures. To achieve a deep insight into this trend, we propose a power conscious performance model based on the rationale of Hill and Marty’s model. Several representative media application kernels are evaluated on the proposed model. The evaluation result shows that: for none communication applications, a large number of small cores achieve optimal performance; for communication applications, architectures with reduced core count and increased core size is preferred. Meanwhile, by increasing the SIMD width, better power efficiency can be achieved for both types of applications at a small loss of performance.

IEICE Electronics Express | 2011

LP2D: a novel low-power 2D memory for sliding-window applications in vector DSPs

Sheng Liu; Shuming Chen; Xi Ning; Jianghua Wan; Hu Chen; Kai Zhang; Yaohua Wang

The power consumption of the 2D memory restricts its usage in vector DSPs for sliding-window applications with irregular memory accesses. This paper introduces a novel Low-Power 2D memory (LP2D), which can effectively reduce the power consumption to the traditional 2D memory without sacrificing its performance. By the theory analysis, we design an adjacent address checker which can generate the bank control mask to turn off some power-sensitive circuits of the 2D memory. Experimental results show that the LP2D can reduce the power consumption by 31.7%∼62.9% with less than 1.3% additional hardware cost, as compared with the traditional 2D memory schemes.

Archive | 2012