Is this you? Create Your Porfile

Mei Wen

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mei Wen is active.

Explore More

Publication

Featured researches published by Mei Wen.

british machine vision conference | 2015

Enable Scale and Aspect Ratio Adaptability in Visual Tracking with Detection Proposals.

Dafei Huang; Lei Luo; Mei Wen; Zhaoyun Chen; Chunyuan Zhang

Among increasingly complicated trackers in visual tracking area, recently proposed correlation filter based trackers have achieved appealing performance despite their great simplicity and superior speed. However, the filter input is a bounding box of fixed size, so they are not born with the adaptability to target’s scale and aspect ratio changes. Although scaleadaptive variants have been proposed, they are not flexible enough due to pre-defined scale sampling manners. Moreover, to the best of our knowledge, no correlation filter variant has been proposed to handle aspect ratio variation. To tackle this problem, this paper integrates the class-agnostic detection proposal method, which is widely adopted in object detection area, into a correlation filter tracker, and presents KCFDP tracker. The correlation filter part of KCFDP is based on KCF[2] with some modifications. We extend the HOG feature in KCF to a combination of HOG, intensity, and color naming by simply concatenating the three features, resulting in 42 feature channels. The model updating scheme in KCF, which is simple linear interpolation, is substituted with a more robust scheme presented in [1]. EdgeBoxes[4] is adopted to generate flexible detection proposals and enable the scale and aspect ratio adaptability of our tracker. It traverses the whole image in a sliding window manner, and scores every sampled bounding box according to the number of contours that are wholly enclosed. To accelerate EdgeBoxes and produce less unnecessary proposals, we set the minimum proposal area and aspect ratio range dynamically in sliding window sampling according to the current target size. In the tracking pipeline, KCF is firstly performed to estimate the preliminary target location ld . Within a patch zd extracted from current frame, KCF locates the target center according to the location of the maximum element in f : f(zd) = kxz d · α, (1)

acm multimedia | 2009

Streaming HD H.264 encoder on programmable processors

Nan Wu; Mei Wen; Wei Wu; Ju Ren; Huayou Su; Changqing Xun; Chunyuan Zhang

Programmable processors have great advantage over dedicated ASIC design under intense time-to-market pressure. However, real-time encoding of high-definition (HD) H.264 video (up to 1080p) is a challenge to most existing programmable processors. On the other hand, model-based design is widely accepted in developing complex media program. Stream model, an emerging model-based programming method, shows surprising efficiency on many compute-intensive domains especially for media processing. On the basis, this paper proposes a set of streaming techniques for H.264 encoding, and then develops all of the code based on the X264 reference code. Our streaming H.264 encoder is a pure software implementation completely written in high-level language without special hardware/algorithm support. Real execution results show that our encoder achieves significant speedup over the original X264 encoder on various programmable architectures: on X86 CoreTM2 E8200 the speedup is 1.8x, on MIPS 4KEc the speedup is 3.7x, on TMS320 C6416 DSP the speedup is 5.5x, on stream processor STORM-SP16 G220 the speedup is 6.1x. Especially, on STORM processor, the streaming encoder achieves the performance of 30.6 frames per second for a 1080P HD sequence, satisfying the real-time requirement. These indicate that streaming is extremely efficient for this kind of media workload. Our work is also applicable for other media processing applications, and provides architecture insights into dedicated ASIC or FPGA HD H.264 encoders.

international conference on parallel and distributed systems | 2012

A Parallel H.264 Encoder with CUDA: Mapping and Evaluation

Nan Wu; Mei Wen; Huayou Su; Ju Ren; Chunyuan Zhang

Efficient mapping of a real-time HD video application to graphics hardware is challenging. Developers face the challenges of choosing the right parallelism model, balancing threads process granularity between massive computing resources on the GPU, and partitioning tasks between the CPU and GPU. The paper illustrated the mapping approaches by a case of HD H.264 encoder based on X264 reference code and then evaluating it on state-of-the-art CPU and GPUs in depth. In the paper, we first split most of the computing task into Single-Instruction Multiple-Thread (SIMT) kernels, which are then chained intocertaininput/output data stream. Then we implementeda completed H.264 encoding on the computer unified device architecture (CUDA) platform. Finally, we present methods for exploiting multi-level parallelism and memory efficiency when mapping H.264 code, which we use to increase the efficiency of the execution on GPUs. Our experimental results show that computation efficiency of GPU and then real-time encoding performance are achieved with CUDA.

international symposium on microarchitecture | 2008

On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator

Mei Wen; Nan Wu; Chunyuan Zhang; Qianming Yang; Jun Ren; Yi He; Wei Wu; Jun Chai; Maolin Guan; Changqing Xun

In this paper shows the extension of application domains, hardware-managed memory structures such as caches are drawing attention for dealing with irregular stream applications. However, since a real application usually has both regular and irregular stream characteristics, conventional stream register files, caches, or combinations thereof have shortcomings. This article focuses on combining software- and hardware-managed memory structures and presents a new syncretic memory system based on the ft64 stream accelerator.

annual computer security applications conference | 2004

Multiple-dimension scalable Adaptive Stream Architecture

Mei Wen; Nan Wu; Haiyan Li; Chunyuan Zhang

Intensive processing applications, such as scientific computation, signal processing, and graphics rendering, motivate new processor architectures that place new burdens on the designer. These applications named Stream Applications demand very high arithmetic rates and data bandwidth, but lack data reuse. At present modern VLSI technology makes arithmetic units relatively cheaper. MASA(Multiple-dimension scalable Adaptive Stream Architecture) presented in this paper is a prototype that operate on streams directly. It is different from DSP and special high performance single-chip architecture because it combines flexibility and high performance. It has basic features of all stream processing, provides bandwidth hierarchy, makes ALU array execute with full loads and decomposes application into a set of computation modules to execute space-multiplexing or time-multiplexing. The multiple dimensions scalability of MASA, includes task-level, loop-level, instruction-level and data-level, and enables it to meet the demand of stream applications. This paper describes MASA architecture and stream model in the first half, and explores the features and advantages of MASA through mapping stream applications to hardware in the second half.

The Journal of Supercomputing | 2013

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference

Jun Chai; Huayou Su; Mei Wen; Xing Cai; Nan Wu; Chunyuan Zhang

Bayesian inference is one of the most important methods for estimating phylogenetic trees in bioinformatics. Due to the potentially huge computational requirements, several parallel algorithms of Bayesian inference have been implemented to run on CPU-based clusters, multicore CPUs, or small clusters of CPUs and GPUs. To the best of our knowledge, however, none of the existing methods is able to simultaneously and fully utilize both CPUs and GPUs for the computations, leaving idle either the CPU part or the GPU part of modern heterogeneous supercomputers. Aiming at an optimized utilization of heterogeneous computing resources, which is a promising hardware architecture for future bioinformatics applications, we present a new hybrid parallel algorithm and implementation of Bayesian phylogenetic inference, which combines MPI, OpenMP, and CUDA programming. The novelty of our algorithm, denoted as oMC3, is its ability of using CPU cores simultaneously with GPUs for the computations, while ensuring a fair work division between the two types of hardware components. We have implemented oMC3 based on MrBayes, which is one of the most popular software packages for Bayesian phylogenetic inference. Numerical experiments show that oMC3 obtains 2.5× speedup over nMC3, which is a cutting-edge GPU implementation of MrBayes, on a single server consisting of two GPUs and sixteen CPU cores. Moreover, oMC3 scales nicely when 128 GPUs and 1536 CPU cores are in use.

embedded systems for real-time multimedia | 2009

Software parallel CAVLC encoder based on stream processing

Ju Ren; Yi He; Wei Wu; Mei Wen; Nan Wu; Chunyuan Zhang

Real-time encoding of high-definition H.264 video is a challenge to current embedded programmable processors. Emerging stream processing methods supported by most GPUs and programmable processors provide a powerful mechanism to achieve surprising high performance in media/signal processing, which bring an opportunity to deal with this challenge. However, traditional serial CAVLC has highly input-dependent execution and precedence constraints, which becomes a bottleneck to implement H.264 encoder efficiently. This paper presents a software parallel CAVLC encoder based on stream processing. Many approaches are explored to solve the restrictions of parallelizing CAVLC caused by data dependency and branch/loop instructions. Experiment results show that our parallel CAVLC encoder on two stream processing platforms of STORM and GPU achieves 3.03x and 2.08x speedup over the original serial CAVLC respectively. Finally, the proposed parallel CAVLC encoder coupled with stream processor enables a real-time encoding of 1080p H.264 video.

The Scientific World Journal | 2014

Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Huayou Su; Mei Wen; Nan Wu; Ju Ren; Chunyuan Zhang

Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIAs GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design.

international conference on image and graphics | 2011

A Multilevel Parallel Intra Coding for H.264/AVC Based on CUDA

Huayou Su; Nan Wu; Chunyuan Zhang; Mei Wen; Ju Ren

In this paper, we propose a multilevel parallel intra coding for H.264/AVC based on computed unified device architecture (CUDA). The proposed parallel algorithm improves the parallelism between 4x4 blocks within a macro block (MB) by throwing off some inappreciable prediction modes. By partitioning a frame into multi-slice, the parallelism between MBs can be exploited. In addition, a scalable parallel method for kernels is introduced to improve the performance of the proposed intra coding. Experimental results show that, more than 20 times speedup can be achieved with the assistance of GPU. Moreover, the entire encoder can meet the real-time processing requirement for HDTV.

Concurrency and Computation: Practice and Experience | 2015

Communication-hiding programming for clusters with multi-coprocessor nodes

Xinnan Dong; Mei Wen; Jun Chai; Xing Cai; Mandan Zhao; Chunyuan Zhang

Future exascale systems are expected to adopt compute nodes that incorporate many accelerators. To shed some light on the upcoming software challenge, this paper investigates the particular topic of programming clusters that have multiple Xeon Phi coprocessors in each compute node. A new offload approach is considered for intra‐node communication, which combines Intels APIs of coprocessor offload infrastructure (COI) and symmetric communication interface (SCIF) for achieving low latency. While the conventional pragma‐based offload approach allows simpler programming, the COI‐SCIF approach has three advantages in (1) lower overhead associated with launching offloaded code, (2) higher data transfer bandwidths, and (3) more advanced asynchrony between computation and data movement. The low‐level COI‐SCIF approach is also shown to have benefits over the MPI‐OpenMP counterpart, which belongs to the symmetric usage mode. Moreover, a hybird programming strategy based on COI‐SCIF is presented for joining the computational force of all CPUs and coprocessors, while realizing communication hiding. All the programming approaches are tested by a real‐world 3D application, for which the COI‐SCIF‐based approach shows a performance advantage on Tianhe‐2. Copyright

Explore More