Is this you? Create Your Porfile

Changqing Xun

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Changqing Xun is active.

Explore More

Publication

Featured researches published by Changqing Xun.

acm multimedia | 2009

Streaming HD H.264 encoder on programmable processors

Nan Wu; Mei Wen; Wei Wu; Ju Ren; Huayou Su; Changqing Xun; Chunyuan Zhang

Programmable processors have great advantage over dedicated ASIC design under intense time-to-market pressure. However, real-time encoding of high-definition (HD) H.264 video (up to 1080p) is a challenge to most existing programmable processors. On the other hand, model-based design is widely accepted in developing complex media program. Stream model, an emerging model-based programming method, shows surprising efficiency on many compute-intensive domains especially for media processing. On the basis, this paper proposes a set of streaming techniques for H.264 encoding, and then develops all of the code based on the X264 reference code. Our streaming H.264 encoder is a pure software implementation completely written in high-level language without special hardware/algorithm support. Real execution results show that our encoder achieves significant speedup over the original X264 encoder on various programmable architectures: on X86 CoreTM2 E8200 the speedup is 1.8x, on MIPS 4KEc the speedup is 3.7x, on TMS320 C6416 DSP the speedup is 5.5x, on stream processor STORM-SP16 G220 the speedup is 6.1x. Especially, on STORM processor, the streaming encoder achieves the performance of 30.6 frames per second for a 1080P HD sequence, satisfying the real-time requirement. These indicate that streaming is extremely efficient for this kind of media workload. Our work is also applicable for other media processing applications, and provides architecture insights into dedicated ASIC or FPGA HD H.264 encoders.

international symposium on microarchitecture | 2008

On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator

Mei Wen; Nan Wu; Chunyuan Zhang; Qianming Yang; Jun Ren; Yi He; Wei Wu; Jun Chai; Maolin Guan; Changqing Xun

In this paper shows the extension of application domains, hardware-managed memory structures such as caches are drawing attention for dealing with irregular stream applications. However, since a real application usually has both regular and irregular stream characteristics, conventional stream register files, caches, or combinations thereof have shortcomings. This article focuses on combining software- and hardware-managed memory structures and presents a new syncretic memory system based on the ft64 stream accelerator.

annual acis international conference on computer and information science | 2006

Analysis and Performance Results of a fluid dynamics Application on MASA Stream Processor

Mei Wen; Nan Wu; Changqing Xun; Wei Wu; Chunyuan Zhang

This paper begins with the study of MASA stream architecture; followed by the discussion of the stream algorithm on MASA, together with experiments using a cycle-accurate hardware simulator to analyze the performance of the implementation and to measure application run-time. The comparison with the traditional, general purpose processors code confirms MASAs potential to deliver high performance

parallel and distributed computing: applications and technologies | 2012

Improving Performance of GPU Specific OpenCL Program on CPUs

Qiang Lan; Changqing Xun; Mei Wen; Huayou Su; Lifang Liu; Chunyuan Zhang

OpenCL provides unified programming interface for various parallel computing platforms. The OpenCL framework manifests good functional portability, the programs can be run on platforms supporting OpenCL programming without any modification. However, most of the OpenCL programs are optimized for massively parallel processors, such as GPU, its hard to achieve good performance on general multi-core processors without sophisticate modification to the GPU specific OpenCL programs. The major reason is the immense gap between CPU and GPU architecture. In this paper, we evaluate the performance portability of OpenCL programs between CPU and GPU, and analyse the reasons why GPU specific OpenCL programs are not fit for CPU. Based on the profiling, we proposed three optimization strategies for improving performance of GPU specific OpenCL programs on CPU, including increasing the granularity of task partition, optimizing the usage of memory hierarchy and block-based data accessing. In addition, we applied the proposed techniques on several benchmarks. The experimental results show that the performance of the optimized OpenCL programs achieve high performance in terms of speedup ratio from 2 to 4 on CPUs, when compared with their corresponding GPU specific ones.

ieee international conference on high performance computing data and analytics | 2007

FT64: scientific computing with streams

Mei Wen; Nan Wu; Chunyuan Zhang; Wei Wu; Qianming Yang; Changqing Xun

This paper describes FT64 and Multi-FT64, single- and multicoprocessor systems designed for high performance scientific computing with streams. We give a detailed case study of porting the Mersenne Prime Search problem to FT64 and Multi-FT64 systems. We discuss several special problems associated with streamizing, such as kernel processing granularity, stream organization and workload partitioning for a multi-processor, which are generally applicable to other scientific codes on FT64. Finally, we perform experiments with eight typical scientific applications on FT64. The results show that a 500MHz FT64 achieves over 50% of its peak performance and a 4.2x peak speedup over 1.6GHz Itanium2. An eight processor Multi-FT64 system achieves 6.8x peak speedup over a single FT64.

international conference on image analysis and recognition | 2005

Accelerated motion estimation of h.264 on imagine stream processor

Haiyan Li; Mei Wen; Chunyuan Zhang; Nan Wu; Li Li; Changqing Xun

Imagine is a stream-based prototype processor designed for media processing. It uses a three-level bandwidth hierarchy to exploit parallelism and data locality. It has good performance in media processing. H.264 is the newest digital video coding standard. It can achieve high coding efficiency at the cost of complex computation. In addition, video pictures have natural stream features, such as good special locality and limited temporal dependency. This paper presents an accelerated implementation of motion estimation, which is the most time-consuming part in H.264 coding framework, on Imagine stream processor. Experimental results show that the coding efficiency for QCIF format can be up to 372fps and surpass real-time requirement. The acceleration of stream processing is significant. It proves that H.264 coding is suited for implementation on Imagine.

european conference on parallel processing | 2014

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Dafei Huang; Mei Wen; Changqing Xun; Dong Chen; Xing Cai; Yuran Qiao; Nan Wu; Chunyuan Zhang

When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. When executing GPU-specific kernels on CPUs, local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns by using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by removing all the unwanted local-memory arrays together with the obsolete barrier statements. Experiments show that the automated transformation can satisfactorily improve OpenCL kernel performances on Sandy Bridge CPU and Intel’s Many-Integrated-Core coprocessor.

multimedia signal processing | 2011

Data Parallelism Exploiting for H.264 Encoder

Mei Wen; Ju Ren; Nan Wu; Huayou Su; Changqing Xun; Chunyuan Zhang

Real-time H.264 encoding of high-definition (HD) video (up to 1080p) is a challenge workload to most existing programmable processors. Instead, the novel programmable parallel processors such as stream processor, Graphic processor unit (GPU) and DSP offer a different and very promising technology for these demands. Thus, parallel computing for H.264 encoding on these processors is becoming a hot research point. Its challenged, because most emerging parallel processors focus on supporting Data Level Parallel (DLP), while the dependency inherently existing in traditional H.264 encoding algorithm significantly restricts exploiting DLP. Facing the challenge, this paper presents data parallel processing methods for key modules of H.264 encoder which can eliminate the dependency restriction. The result shows that key modules including Intra-prediction, Inter-prediction and CAVLC achieve significant speedup on stream processor by using these data parallel processing methods.

ieee international conference on high performance computing, data, and analytics | 2009

Cache streamization for high performance stream processor

Nan Wu; Mei Wen; Ju Ren; Yi He; Changqing Xun; Wei Wu; Chunyuan Zhang

Due to high bandwidth demand on memory system of stream applications, most of stream processors use software-managed streaming memory. However, this memory disadvantages ease of programming, compatibility, and supporting irregular stream access, which hinder the usage of stream processor in broader application domains. Meanwhile, hardware-managed coherent caches overcome these shortcomings of software-managed streaming memory with side-effect due to lack of supporting stream. For this problem, this paper developed a streamization cache whose performance is comparable to streaming memory but is more easy to use. The paper presents the motivation and details of our proposed design, including three stream-specific techniques for cache on data fetch policy, replacement policy and multi-client access. Moreover, a streamization cache instance is implemented in FT64, a 64-bit high performance stream processor. Based on a set of streaming application benchmark, the paper estimates the performance, power consumption and the area cost of the proposed architecture. Results show that these streamization techniques for cache are worthwhile.

2009 Fourth International Conference on Embedded and Multimedia Computing | 2009

A Framework for Stream Programming on DSP

Changqing Xun; Mei Wen; Wei Wu; Chunyuan Zhang

There has recently been much interest in stream processing, both in industry (Cell, Storm series, NVIDIA G80, AMD FIRESTREAM) and academia (IMAGINE). Some researchers have accelerated a lot of applications in media processing, scientific computing and signal processing with a special programming style called stream programming. This paper presents a framework to program DSP with this special programming style. Stream program can run on DSP without any architectural support. H264 encoding is selected to evaluate our technique. The result shows that significant speedup is achieved, ranging from 3.2x for cavlc up to 7.1x for analysis.

Explore More