Shuwei Sun
National University of Defense Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shuwei Sun.
high performance computing and communications | 2007
Shuwei Sun; Dong Wang; Shuming Chen
This paper proposes a highly efficient MBRP parallel algorithm for H.264 encoder, which is based on the analysis of data dependencies in H.264 encoder. In the algorithm, the video frames are partitioned into several MB regions, each of which consists of several adjoining columns of macroblocks (MB), which could be encoded by one processor of a multi-processor system. While starting up the encoding process, the wave-front technique is adopted, and the processors begin encoding process orderly. In the MBRP parallel algorithm, the quantity of data that needs to be exchanged between processors is small, and the loads in different processors are balanced. The algorithm could efficiently encode the video sequence without any influence on the compression ratio. Simulation results show that the proposed MBRP parallel algorithm can achieve higher speedups compared to previous approaches, and the encoding quality is the same as JM 10.2.
high performance computing and communications | 2010
Shenggang Chen; Shuming Chen; Huitao Gu; Hu Chen; Yaming Yin; Xiaowen Chen; Shuwei Sun; Sheng Liu; Yaohua Wang
The emergence of large-scale chip multicore processors makes the on-chip parallel H.264/AVC encoder with high parallelism feasible. To reduce the data reload frequency, a hierarchical chip multi-core DSP platform with overall 64 DSP cores is designed to accommodate the computation/data-intensive H.264/AVC encoder. To increase parallelism, macro block level parallelism is exploited in this paper and wave front algorithm is utilized. Centralized shared memory in super nodes of this hierarchical DSP platform affords larger local space to hold the frequently used data and reduce bandwidth requirement. Subtask level parallelism within motion estimation, intra prediction and mode decision is further exploited to keep the DSP cores in a super node busy even only one macro block are assigned to a super node. Because of lack of available macro blocks in filling and emptying stages when encoding a frame, super nodes cannot be kept busy all the time and speedups of 13, 24, 26 and 49 are achieved for QCIF, SIF, CIF and HD sequences, respectively. To further improve the speedups and make best use of the processor resources, frame level parallelism should be exploited with carefully tuned memory allocation policy.
IEEE Transactions on Circuits and Systems for Video Technology | 2010
Shenggang Chen; Shuming Chen; Shuwei Sun
Due to its high-computational complexity and poor parallelism, the context-based adaptive binary arithmetic coder (CABAC) increasingly poses a bottleneck in the large-scale parallel video encoder like H.264 on a manycore platform. Motivated by the organization of the contexts models in CABAC, this letter presents a tri-thread parallel evolution of CABAC. The evolutional coder, which is named P3-CABAC, statically divides syntax elements into three predefined groups, each of which forms a parallel thread. Since the P3-CABAC is thread-level parallelizable and implementation-friendly for manycore processors, it presents a rational consideration for entropy coding in block-based video encoders in the manycore era.
congress on image and signal processing | 2008
Shuwei Sun; Shuming Chen
CABAC is the entropy coding method adopted in H.264/AVC main profile. RDO is the mode decision method commended by JVT, which can improve the performance of encoder markedly. However, the computation complexity of video encoding increases drastically. In this paper, we propose a novel fast intra-mode decision algorithm for H.264, based on transform-domain bit-rate estimation technique, using the estimated values of rate to calculate the Lagrange cost instead of the real values after entropy coding, thus avoiding the process of entropy coding in mode decision and reducing the computing time for video encoding. Besides, the intra-mode decisions for luma and chroma components are performed independently, which avoids the two-layer nested iteration in mode decision. Simulation results show that the proposed algorithm saves about 80% of total encoding time while the losses of PSNR and bit-rate are negligible for IIII sequence.
parallel and distributed computing: applications and technologies | 2010
Huitao Gu; Shuming Chen; Shuwei Sun; Shenggang Chen; Xiaowen Chen
Motion estimation is the most computationally intensive part in a typical video encoder. This paper introduces a novel prediction method based on multiple search centers, which can efficiently improve the prediction accuracy. According to the amount and the magnitude of predictive search centers, the search range is dynamically adjusted in order to reduce the computational complexity. Based on multiple search centers prediction and dynamic search range adjustment, a new fast algorithm is proposed. Experimental results show that, on an average, the proposed algorithm gains similar rate-distortion performance to the original FFS, UMHexagonS and EPZS algorithms in H.264 reference software, while reducing about 97.4%, 63.0%, and 42.9% computational complexity respectively.
annual computer security applications conference | 2007
Dong Wang; Xiaowen Chen; Shuming Chen; Xing Fang; Shuwei Sun
Multi-core Digital Signal Processors (DSP) have significant requirements on data storage and memory performance for high performance embedded applications. Scratch-pad memories (SPM) are low capacity high-speed on-chip memories mapped with global addresses, which are preferred by embedded applications than traditional caches due to their better real-time characterization. We construct a new Fast Close-Coupled Shared Data Pool (FCC-SDP) for our multi-core DSP project based on SPMs. FCC-SDP is organized as multibank parallel structure with double-bank interleaving access modes, and provides a fast transmission path for fine-grain shared data among DSP cores. We build the behavior simulator of FCC-SDP and make design realization. Simulation experiments with several typical benchmarks show that FCC-SDP can well capture the fine-grain shared data in multi-core applications, and can achieve average speedup ratio of 1.1 and 1.14 compared with traditional shared L2 caches and DMA transmission modes respectively.
ieee computer society annual symposium on vlsi | 2015
Xiaowen Chen; Zhonghai Lu; Yang Li; Axel Jantsch; Xueqian Zhao; Shuming Chen; Yang Guo; Zonglin Liu; Jianzhuang Lu; Jianghua Wan; Shuwei Sun; Shenggang Chen; Hu Chen
3D many-core NoCs are emerging architectures for future high-performance single chips due to its integration of many processor cores and memories by stacking multiple layers. In such architecture, because processor cores and memories reside in different locations (center, corner, edge, etc.), memory accesses behave differently due to their different communication distances, and the performance (latency) gap of different memory accesses becomes larger as the network size is scaled up. This phenomenon may lead to very high latencies suffered from by some memory accesses, thus degrading the system performance. To achieve high performance, it is crucial to reduce the number of memory accesses with very high latencies. However, this should be done with care since shortening the latency of one memory access can worsen the latency of another as a result of shared network resources. Therefore, the goal should focus on narrowing the latency difference of memory accesses. In the paper, we address the goal by proposing to prioritize the memory access packets based on predicting the round-trip routing latencies of memory accesses. The communication distance and the number of the occupied items in the buffers in the remaining routing path are used to predict the round-trip latency of a memory access. The predicted round-trip routing latency is used as the base to arbitrate the memory access packets so that the memory access with potential high latency can be transferred as early and fast as possible, thus equalizing the memory access latencies as much as possible. Experiments with varied network sizes and packet injection rates prove that our approach can achieve the goal of memory access equalization and outperforms the classic round-robin arbitration in terms of maximum latency, average latency, and LSD1. In the experiments, the maximum improvement of the maximum latency, the average latency and the LSD are 80%, 14%, and 45% respectively.
congress on image and signal processing | 2008
Dong Wang; Shuming Chen; Shuwei Sun
Multi-Core Digital Signal Processors (MC-DSPs) often suffer from limited memory bandwidth and long access latency caused by shared-memory structures. Data forwarding is an efficient data speculation technique to hide memory access latencies. This paper proposes a new Data Stream Clustered Forwarding (DSCF) technique for MC-DSPs in shared-memory structures. DSCF combines data streams forwarding operations triggered by specific forwarding primitives with a background transmission controlled by customized control units. DSCF is based on normal cache coherency protocols and has lower hardware overhead, no pollutions to destination DSP caches and good system scalability. The simulation results show that, DSCF can reduce the cache coherency miss ratio of our MC-DSP prototype by 47% on average, get 1.41 performance speedup ratio compared with the original structure without data speculation, and improve the system performance by about 12% compared with the related data forwarding technique.
Archive | 2012
Shuming Chen; Jianghua Wan; Hengzhu Liu; Yueyue Chen; Shuwei Sun; Zhentao Li; Jianzhuang Lu
Archive | 2006
Dong Wang; Yanan Lu; Shuming Chen; Yang Guo; Shuwei Sun; Xiao Hu; Xing Fang