Shenggang Chen
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shenggang Chen.
high performance computing and communications | 2010
Shenggang Chen; Shuming Chen; Huitao Gu; Hu Chen; Yaming Yin; Xiaowen Chen; Shuwei Sun; Sheng Liu; Yaohua Wang
The emergence of large-scale chip multicore processors makes the on-chip parallel H.264/AVC encoder with high parallelism feasible. To reduce the data reload frequency, a hierarchical chip multi-core DSP platform with overall 64 DSP cores is designed to accommodate the computation/data-intensive H.264/AVC encoder. To increase parallelism, macro block level parallelism is exploited in this paper and wave front algorithm is utilized. Centralized shared memory in super nodes of this hierarchical DSP platform affords larger local space to hold the frequently used data and reduce bandwidth requirement. Subtask level parallelism within motion estimation, intra prediction and mode decision is further exploited to keep the DSP cores in a super node busy even only one macro block are assigned to a super node. Because of lack of available macro blocks in filling and emptying stages when encoding a frame, super nodes cannot be kept busy all the time and speedups of 13, 24, 26 and 49 are achieved for QCIF, SIF, CIF and HD sequences, respectively. To further improve the speedups and make best use of the processor resources, frame level parallelism should be exploited with carefully tuned memory allocation policy.
IEEE Transactions on Circuits and Systems for Video Technology | 2010
Shenggang Chen; Shuming Chen; Shuwei Sun
Due to its high-computational complexity and poor parallelism, the context-based adaptive binary arithmetic coder (CABAC) increasingly poses a bottleneck in the large-scale parallel video encoder like H.264 on a manycore platform. Motivated by the organization of the contexts models in CABAC, this letter presents a tri-thread parallel evolution of CABAC. The evolutional coder, which is named P3-CABAC, statically divides syntax elements into three predefined groups, each of which forms a parallel thread. Since the P3-CABAC is thread-level parallelizable and implementation-friendly for manycore processors, it presents a rational consideration for entropy coding in block-based video encoders in the manycore era.
Computers & Electrical Engineering | 2013
Xiaowen Chen; Zhonghai Lu; Axel Jantsch; Shuming Chen; Shenggang Chen; Huitao Gu
In Network-on-Chip (NoC) based multi-core platforms, Distributed Shared Memory (DSM) preferably uses virtual addressing in order to hide the physical locations of the memories. However, this incurs performance penalty due to the Virtual-to-Physical (V2P) address translation overhead for all memory accesses. Based on the data property which can be either private or shared, this paper proposes a hybrid DSM which partitions a local memory into a private and a shared part. The private part is accessed directly using physical addressing and the shared part using virtual addressing. In particular, the partitioning boundary can be configured statically at design time and dynamically at runtime. The dynamic configuration further removes the V2P address translation overhead for those data with changeable property when they become private at runtime. In the experiments with three applications (matrix multiplication, 2D FFT, and H.264/AVC encoding), compared with the conventional DSM, our techniques show performance improvement up to 37.89%.
parallel and distributed computing: applications and technologies | 2010
Huitao Gu; Shuming Chen; Shuwei Sun; Shenggang Chen; Xiaowen Chen
Motion estimation is the most computationally intensive part in a typical video encoder. This paper introduces a novel prediction method based on multiple search centers, which can efficiently improve the prediction accuracy. According to the amount and the magnitude of predictive search centers, the search range is dynamically adjusted in order to reduce the computational complexity. Based on multiple search centers prediction and dynamic search range adjustment, a new fast algorithm is proposed. Experimental results show that, on an average, the proposed algorithm gains similar rate-distortion performance to the original FFS, UMHexagonS and EPZS algorithms in H.264 reference software, while reducing about 97.4%, 63.0%, and 42.9% computational complexity respectively.
Journal of Electrical and Computer Engineering | 2015
Xiaowen Chen; Zhonghai Lu; Axel Jantsch; Shuming Chen; Yang Guo; Shenggang Chen; Hu Chen
On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniformand hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance models analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.
international symposium on signals, systems and electronics | 2010
Shenggang Chen; Shuming Chen; Yao Liu; Huitao Gu
This paper proposes the algorithm and its hardware architecture of a new CABAC-based parallel arithmetic entropy coder for on-chip large-scale parallel H.264 video coders. The proposed entropy coder partitions syntax elements into three independent groups according to their semantic dependencies, and processes them in three parallel threads, each of which is implemented as a fully pipelined hardware accelerator driving by Syntax Element Instructions. To preserve decodability, each thread of the proposed entropy coder deals with the syntax elements in the way of CABAC. This orthogonality allows the previous symbol-level parallel techniques to be applied in each of its threads to further improve the throughput. However, this feature also constrains the partitioning of the syntax elements and thus the obtained speedups. Simulation and physical implementation results show that, the proposed coder can double its throughput at the expense of about 60% hardware transistors compared to CABAC while keeping the advantage of arithmetic coding against the other methods.
ACM Transactions on Architecture and Code Optimization | 2016
Yaohua Wang; Dong Wang; Shuming Chen; Zonglin Liu; Shenggang Chen; Xiaowen Chen; Xu Zhou
The efficacy of single instruction, multiple data (SIMD) architectures is limited when handling divergent control flows. This circumstance results in SIMD fragments using only a subset of the available lanes. We propose an iteration interleaving--based SIMD lane partition (IISLP) architecture that interleaves the execution of consecutive iterations and dynamically partitions SIMD lanes into branch paths with comparable execution time. The benefits are twofold: SIMD fragments under divergent branches can execute in parallel, and the pathology of fragment starvation can also be well eliminated. Our experiments show that IISLP doubles the performance of a baseline mechanism and provides a speedup of 28% versus instruction shuffle.
asia and south pacific design automation conference | 2011
Shuming Chen; Xiaowen Chen; Yi Xu; Jianghua Wan; Jianzhuang Lu; Xiangyuan Liu; Shenggang Chen
This paper presents a novel heterogeneous multi-core Digital Signal Processor, named YHFT-QDSP, hosting one RISC CPU core and four VLIW DSP cores. The CPU core is responsible for task scheduling and management, while the DSP cores take charge of speeding up data processing. The YHFT-QDSP provides three kinds of interconnection communication. One is for inner-chip communication between the CPU core and the four DSP cores, the other two for both inner-chip and inter-chip communication amongst DSP cores. The YHFT-QDSP is implemented under SMIC® 130nm LVT CMOS technology and can run [email protected] with 114.49 mm2 die area.
ieee computer society annual symposium on vlsi | 2015
Xiaowen Chen; Zhonghai Lu; Yang Li; Axel Jantsch; Xueqian Zhao; Shuming Chen; Yang Guo; Zonglin Liu; Jianzhuang Lu; Jianghua Wan; Shuwei Sun; Shenggang Chen; Hu Chen
3D many-core NoCs are emerging architectures for future high-performance single chips due to its integration of many processor cores and memories by stacking multiple layers. In such architecture, because processor cores and memories reside in different locations (center, corner, edge, etc.), memory accesses behave differently due to their different communication distances, and the performance (latency) gap of different memory accesses becomes larger as the network size is scaled up. This phenomenon may lead to very high latencies suffered from by some memory accesses, thus degrading the system performance. To achieve high performance, it is crucial to reduce the number of memory accesses with very high latencies. However, this should be done with care since shortening the latency of one memory access can worsen the latency of another as a result of shared network resources. Therefore, the goal should focus on narrowing the latency difference of memory accesses. In the paper, we address the goal by proposing to prioritize the memory access packets based on predicting the round-trip routing latencies of memory accesses. The communication distance and the number of the occupied items in the buffers in the remaining routing path are used to predict the round-trip latency of a memory access. The predicted round-trip routing latency is used as the base to arbitrate the memory access packets so that the memory access with potential high latency can be transferred as early and fast as possible, thus equalizing the memory access latencies as much as possible. Experiments with varied network sizes and packet injection rates prove that our approach can achieve the goal of memory access equalization and outperforms the classic round-robin arbitration in terms of maximum latency, average latency, and LSD1. In the experiments, the maximum improvement of the maximum latency, the average latency and the LSD are 80%, 14%, and 45% respectively.
networking architecture and storages | 2011
Sheng Liu; Shuming Chen; Haiyan Chen; Shenggang Chen; Hu Chen
This paper presents a Vector DMA Cache (VDC)scheme between the DMA bus and the Vector Memory (VM) in vector DSPs. The VDC can effectively reduce the VM access counts from the DMA requests and decrease the VM access conflicts. The VDC is specially designed for the DMA and not for the CPU, so it has some unique techniques which differ from the traditional CPU cache scheme. The main techniques of the VDC include the separate read cache and write cache, full line auto-updating policy and software cache coherence. Experimental results show the single-port VM plus the VDC can make programs reach more than 95% execution efficiency with only 44.1% and 51.4% chip area and power cost, compared with the dual-port VM scheme. And the single-port VM plus the VDC can reduce the execution cycles of programs by 3.7% 21.5% with only additional 7.3% and 6.3% chip area and power cost, compared with the pure single-port VM scheme.