You-Ming Tsao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where You-Ming Tsao is active.

Explore More

Publication

Featured researches published by You-Ming Tsao.

international conference on multimedia and expo | 2007

Multi-Pass and Frame Parallel Algorithms of Motion Estimation in H.264/AVC for Generic GPU

Chuan-Yiu Lee; Yu-Cheng Lin; Chi-Ling Wu; Chin-Hsiang Chang; You-Ming Tsao; Shao-Yi Chien

In this paper, multi-pass and frame parallel algorithms are proposed to accelerate various motion estimation (ME) tools in H.264 with the graphics processing unit (GPU). By the multi-pass method to unroll and rearrange the multiple nested loops, the integer-pel ME can be implemented with two-pass process on GPU. Moreover, fractional ME needs six passes for frame interpolation with six-tap filter and motion vector refinement. Motion estimation with multiple reference frames can be implemented with two-pass process with frame-level parallel scheme by use of SIMD vector operations of GPU. Experimental results show that, compared to implementations with only CPU, about 6 times to 56 times speed-up can be achieved for different ME algorithms.

international symposium on circuits and systems | 2006

Multi-pass algorithm of motion estimation in video encoding for generic GPU

Yu-Cheng Lin; Pei-Lun Li; Chin-Hsiang Chang; Chi-Ling Wu; You-Ming Tsao; Shao-Yi Chien

The importance of video encoding has boomed rapidly since video data communication was widely needed. In this paper, we propose a multi-pass algorithm to accelerate the motion estimation (ME), the dominant part in video encoding, with the graphics processing unit (GPU). By the multi-pass method to unroll and rearrange the multiple nested loops, the complex ME can be implemented on GPU. Besides, ME can be executed efficiently with the built-in parallel processing and texture filter of GPU. Experimental results show that, by utilizing the computing power of GPU, about two times and 14 times speed-up can be achieved for integer-pel ME and MPEG-1/2 half-pel ME, respectively

IEEE Transactions on Multimedia | 2009

High-Quality Mipmapping Texture Compression With Alpha Maps for Graphics Processing Units

Chih-Hao Sun; You-Ming Tsao; Shao-Yi Chien

Texture compression is an important technique in graphics processing units (GPUs) for saving memory bandwidth. This paper presents a high-quality mipmapping texture compression (MTC) system with alpha maps. Based upon the wavelet transform, a hierarchical approach is adopted for mipmapping textures in the YCbCr color space and alpha channel. By inspecting the similarity between the alpha and luminance channels, the two channels are efficiently encoded together with linear prediction in the differential mode. In addition, the split mode manages textures with no strong relationship between the alpha and luminance channels. A layer overlapping technique is also proposed to reduce the texture memory bandwidth. Simulation results show that MTC can reduce the texture access traffic by 80% to 90% and provides high image quality as well. Compared with DirectX texture compression (DXTC), the most well-known texture compression with alpha maps, MTC reduces the texture access bandwidth by 30% more. VLSI implementation results show that the hardware cost of MTC is similar to that of DXTC and that MTC is suitable for integration in GPUs to provide high-quality textures with low memory bandwidth requirements.

international conference on multimedia and expo | 2009

Universal Rasterizer with edge equations and tile-scan triangle traversal algorithm for graphics processing units

Chih-Hao Sun; You-Ming Tsao; Ka-Hang Lok; Shao-Yi Chien

The rasterization stage in a graphics processing unit (GPU), which consists of triangle setup, rasterization, and parameter interpolation with plane equations, always requires huge operations and is usually the bottleneck of the performance. For real-time applications, a Universal Rasterizer (UR) with edge equations and a tile-scan triangle traversal algorithm are proposed for low cost graphics rendering. In UR, the basic functions for parameter interpolation and rasterization can be executed with a universal shared hardware to reduce the cost. The result shows that it can minimize the processing time of triangle traversal and guarantee no reiteration when traverse. With the hardware sharing and architecture design techniques of pipelining and scheduling, it can achieve the real-time requirements for graphics applications with reasonable hardware cost.

international soc design conference | 2008

Energy-saving techniques for low-power graphics processing unit

Chia-Ming Chang; Shao-Yi Chien; You-Ming Tsao; Chih-Hao Sun; Ka-Hang Lok; Yu-Jung Cheng

This paper presents a graphics processing unit with energy-saving techniques. Several techniques and architectures are proposed to achieve high performance with low power consumption. First of all, low power core pipeline is designed with 2-issue VLIW architecture to reduce power consumption while achieving the processing capability of 400MFLOPS or 800MOPS. In addition, inter/intra adaptive mutli-threading scheme can increase the performance by increasing hardware utilization, and the proposed configurable memory array architecture can reduce off-chip memory accessing frequency by caching both input data and output results. Furthermore, for graphics applications, a geometry-content-aware technique called early-rejection-after-transformation is proposed to remove redundant operations for invisible triangles. As for circuit level power reduction, power-aware frequency scaling is proposed to further reduce the power consumption.

IEEE Journal of Solid-state Circuits | 2008

An 8.6 mW 25 Mvertices/s 400-MFLOPS 800-MOPS 8.91 mm

Shao-Yi Chien; You-Ming Tsao; Chin-Hsiang Chang; Yu-Cheng Lin

For the demands of mobile multimedia applications, a stream processor core is designed with 8.91 mm2 area in 0.18 mum CMOS technology at 50 MHz. Several techniques and architectures are proposed to achieve high performance with low power consumption. First of all, an optimized core pipeline is designed with 2-issue VLIW architecture to achieve the processing capability of 400 MFLOPS or 800 MOPS. In addition, adaptive multi-thread scheme can increase the performance by increasing hardware utilization, and the proposed configurable memory array architecture can reduce off-chip memory accessing frequency by caching both input data and output results. Furthermore, for graphics applications, a geometry-content-aware technique called early-rejection-after-transformation is proposed to remove redundant operations for invisible triangles. As for video applications, the proposed video accelerating instruction set can support motion estimation for video coding. Experimental results show that 86% power reduction and more than ten times speedup of the VLIW architecture can be achieved with the proposed techniques to provide the processing speed of 25 Mvertices/s and power consumption of 8.6 mW. Moreover, CIF (352 times 288) 30 fps video encoding with the search range of {H[-24,24], V[-16,16]} is also supported by the proposed stream processor. By supporting both video and graphics functions, this highly efficient, high performance, and low power processor core is applicable to multimedia mobile devices.

high performance graphics | 2009

^{2}

Chih-Hao Sun; Ka-Hang Lok; You-Ming Tsao; Chia-Ming Chang; Shao-Yi Chien

In order to increase the capability of mobile GPUs in image/video processing, a multi-purpose configurable filtering unit (CFU), which is a new configurable unit for image filtering on stream processing architecture, is proposed in this paper. CFU is located in the texture unit of a GPU and can efficiently execute many kinds of filtering operations by directly accessing multi-bank texture cache and specially-designed data-paths. The following programmabilities are supported in our proposed CFU. First, different sampling point windows can be selected by programmers. Besides, the arithmetic type of the filter can be chosen. Not only original texture filtering functions and finite impulse response (FIR) filters, morphological operations in computer vision are also embedded in CFU. Furthermore, the weighting coefficients of FIR filters and morphological operations can be defined by programmers. Simulation results show that in average, compared with conventional texture unit, 25.35% of processing time in H.264/AVC motion compensation and 58.6% of processing time in video segmentation can be reduced with the assistance of CFU.

symposium on vlsi circuits | 2007

Multimedia Stream Processor Core for Mobile Applications

You-Ming Tsao; Chin-Hsiang Chang; Yu-Cheng Lin; Shao-Yi Chien; Liang-Gee Chen

An 8.6 mW stream processor core for mobile applications is implemented with 8.91 mm2 area in 0.18 mum CMOS technology at 50 MHz. The adaptive multi-thread architecture with configurable memory array and geometry-content-aware technique are proposed to reduce power consumption while achieving 12.5 Mvertices/s for 3D graphics and motion estimation with search range {H[-24,24),V[-16,16]} for CIF (352times288) 30 fps video encoding.

international conference on consumer electronics | 2006

CFU: multi-purpose configurable filtering unit for mobile multimedia applications on graphics hardware

You-Ming Tsao; Shao-Yi Chien; Chin-Hsiang Chang; Chung-Jr Lian; Liang-Gee Chen

In this paper, a low power programmable vertex shader with video coding acceleration instructions is proposed. For mobile multimedia applications, supporting both video and graphics is a promising trend. The proposed programmable graphics engine features a unified architecture that can efficiently execute not only vertex shader operations for graphics but also the motion estimation of video coding algorithms. It can achieve the processing speed of 8.3M vertex geometry transformations per second and 6.25M polygons per second with the working frequency of 50 MHz and the power consumption of 20 mW. Furthermore, the floating/fixed-point data path, the reconfigurable memory, and special instructions are designed to be able to accelerate the key operation, motion estimation, in video coding. The execution of motion estimation on the proposed graphics engine is shown to be 80 times faster than RISC type processors and can achieve real-time video coding requirements with diamond search algorithm for 30 CIF (352/spl times/288) frames per second with [-16, 15] search range. This powerful graphics and video dual-function programmable engine is shown to be a good solution for multimedia consumer products.

picture coding symposium | 2009

An 8.6mW 12.5Mvertices/s 800MOPS 8.91mm 2 Stream Processor Core for Mobile Graphics and Video Applications

Yu-Chi Su; Sung-Fang Tsai; Tzu-Der Chuang; You-Ming Tsao; Liang-Gee Chen

Scalable Video Coding (SVC) is an advanced video compression technique that can support temporal, spatial, and quality scalability to terminals with different network conditions. SVC adopts layered coding techniques to improve coding efficiency for spatial and quality scalability. Upsampling and inter-layer prediction are two important mechanisms to remove redundant information between different layers. However, upsampling occupying around 75% memory bandwidth of SVC decoder results in serious performance degradation, especially for applications with high resolutions. Moreover, inter-layer prediction with complex scheduling leads to difficulties when mapping the SVC decoder in parallel. In this paper, we propose a method to parallelize the SVC decoder on a multi-core stream processor platform in both efficiency and flexibility. We focus on mapping issues of spatial scalability supporting with various resolutions of decoded frames. The experiment result proves the proposed design for SVC decoder reduces 95% memory bandwidth of the upsampling module in JSVM, performed on a single general-purpose processor.

Explore More