Tu-Chih Wang
National Taiwan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tu-Chih Wang.
IEEE Transactions on Circuits and Systems | 2006
Ching-Yeh Chen; Shao-Yi Chien; Yu-Wen Huang; Tung-Chien Chen; Tu-Chih Wang; Liang-Gee Chen
Variable block-size motion estimation (VBSME) has become an important video coding technique, but it increases the difficulty of hardware design. In this paper, we use inter-/intra-level classification and various data flows to analyze the impact of supporting VBSME in different hardware architectures. Furthermore, we propose two hardware architectures that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches. By broadcasting reference pixel rows and propagating partial sums of absolute differences (SADs), the first design has the fewer reference pixel registers and a shorter critical path. The second design utilizes a two-dimensional distortion array and one adder tree with the reference buffer that can maximize the data reuse between successive searching candidates. The first design is suitable for low resolution or a small search range, and the second design has advantages of supporting a high degree of parallelism and VBSME. Finally, we propose an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation (IME). Its processing ability is eight times of the single SAD tree, but the reference buffer size is only doubled. Moreover, the most critical issue of H.264 IME, which is huge memory bandwidth, is overcome. We are able to save 99.9% off-chip memory bandwidth and 99.22% on-chip memory bandwidth. We demonstrate a 720-p, 30-fps solution at 108 MHz with 330.2k gate count and 208k bits on-chip memory
international solid-state circuits conference | 2005
Yu-Wen Huang; Tung-Chien Chen; Chen-Han Tsai; Ching-Yeh Chen; To-Wei Chen; Chi-Shi Chen; Chun-Fu Shen; Shyh-Yih Ma; Tu-Chih Wang; Bing-Yu Hsieh; Hung-Chi Fang; Liang-Gee Chen
An H.264/AVC encoder is implemented on a 31.72mm/sup 2/ die with 0.18/spl mu/m CMOS technology. A four-stage macroblock pipelined architecture encodes 720p 30f/s HDTV videos in real time at 108MHz. The encoded video quality is competitive with reference software requiring 3.6TOPS on a general-purpose processor-based platform.
international symposium on circuits and systems | 2003
Tu-Chih Wang; Yu-Wen Huang; Hung-Chi Fang; Liang-Gee Chen
Transform coding has been widely used in video coding standards. In this paper, a hardware architecture for accelerating transform coding operations in MPEG-4 AVC/H.264 is presented. This architecture calculates 4 inputs in parallel by fast algorithms described previously. The transpose operations are implemented by a register array with directional transfers. This architecture has been mapped into a 4 /spl times/ 4 multiple transforms unit and synthesized in TSMC 0.35um technology. The multiple transform processor can process 320M pixels/sec at 80Mhz for all 4 /spl times/ 4 transforms used in MPEG-4 AVC/ H.264.
international symposium on circuits and systems | 2003
Yu-Wen Huang; Tu-Chih Wang; Bing-Yu Hsieh; Liang-Gee Chen
Variable block size motion estimation is adopted in the new video coding standard, MPEG-4 AVC/JVT/ITU-T H.264, due to its superior performance compared to the advanced prediction mode in MPEG-4 and H.263+. In this paper, we modified the reference software in a hardware-friendly way. Our main idea is to convert the sequential processing of each 8/spl times/8 sub-partition of a macro-block into parallel processing without sacrifice of video quality. Based on our algorithm, we proposed a new hardware architecture for variable block size motion estimation with full search at integer-pixel accuracy. The features of our design are 2-D processing element array with 1-D data broadcasting and 1-D partial result reuse, parallel adder tree, memory interleaving scheme, and high utilization. Simulation shows that our chip can achieve real-time applications under the operating frequency of 64.11 MHz for 720/spl times/480 frame at 30 Hz with search range of [-24, +23] in horizontal direction and [-16, +15] in vertical direction, which requires the computation power of more than 50 GOPS.
international conference on multimedia and expo | 2003
Yu-Wen Huang; To-Wei Chen; Bing-Yu Hsieh; Tu-Chih Wang; Te-Hao Chang; Liang-Gee Chen
This paper presents an efficient VLSI architecture for the deblocking filter in H.264/JVT/AVC. We use an array of 8/spl times/4 8-bit shift registers with reconfigurable data path to support both horizontal filtering and vertical filtering on the same circuit (a parallel-in parallel-out reconfigurable FIR filter). Two SRAM modules are carefully organized not only for the storage of current macroblock data and adjacent block data but also for the efficient access of pixels in different blocks. Simulation results show that under 0.25 /spl mu/m technology, the synthesized logic gate count is only 19.1 K (not including a 96/spl times/32 SRAM and a 64/spl times/32 SRAM) when the maximum frequency is 100 MHz. Our architecture design can easily support real-time deblocking of 720p (1280/spl times/720) 30 Hz video. It is valuable for platform-based design of H.264 codec.
international conference on acoustics, speech, and signal processing | 2003
Yu-Wen Huang; Bing-Yu Hsieh; Tu-Chih Wang; Shao-Yi Chient; Shyh-Yih Ma; Chun-Fu Shen; Liang-Gee Chen
In the new video coding standard, MPEG-4 AVC/JVT/H.264, motion estimation is allowed to use multiple reference frames. The reference software adopts a full search scheme, and the increased computation is in proportion to the number of searched reference frames. However, the reduction of prediction residues is highly dependent on the nature of the sequences, not on the number of searched frames. We present a method to speed up the matching process for multiple reference frames. For each macroblock, we analyze the available information after intra prediction and motion estimation from the previous frame to determine whether it is necessary to search more frames. The information we use includes selected mode, inter prediction residues, intra prediction residues, and motion vectors. Simulation results show that the proposed algorithm can save up to 90% of unnecessary frames while keeping the average miss rate of optimal frames less than 4%.
IEEE Transactions on Circuits and Systems for Video Technology | 2001
Jun-Fu Shen; Tu-Chih Wang; Liang-Gee Chen
In this paper, a low-power full-search block matching (FSBM) motion-estimation design for the ITU-T recommendation H.263+ standard is proposed. New motion-estimation modes in H.263+ can be fully supported by our architecture. Unlike most previously presented motion-estimation chips, this design can deal with 8/spl times/8 and 16/spl times/16 block size with different searching ranges. Basically, the proposed architecture is composed of an integer pixel unit with 64 processing elements, and a half-pixel unit with interpolation, a control unit, and data registers. In order to minimize power consumption, gated-clock and dual-supply voltages are used. This design has been realized by TSMC 0.6 /spl mu/m SPTM CMOS technology. The power consumption is 423.8 mW at 60 MHz and the throughput is 36 fps in CIF format.
international symposium on circuits and systems | 2003
Hung-Chi Fang; Tu-Chih Wang; Chung-Jr Lian; Te-Hao Chang; Liang-Gee Chen
This paper presents a high speed, memory efficient architecture of embedded block coding with optimized truncation (EBCOT) tier-1 in JPEG2000. By parallel coding all the bitplanes, the state variable memory can be eliminated. The proposed architecture can process 50 M coefficients per second at 100 MHz, which can realtime encode 720p resolution of HDTV picture format at 30 fps.
international solid-state circuits conference | 2004
Hung-Chi Fang; Chao-Tsung Huang; Yu-Wei Chang; Tu-Chih Wang; Po-Chih Tseng; Chung-Jr Lian; Liang-Gee Chen
An 81MS/s JPEG 2000 single-chip encoder is implemented on a 5.5mm/sup 2/ die using 0.25/spl mu/m CMOS technology. This IC can encode HDTV 720p resolution at 30 frames/s in real time. The rate-distortion optimized chip encodes tile size of 128/spl times/128, code block size of 64/spl times/64, and image size up to 32K/spl times/32K.
IEEE Transactions on Circuits and Systems for Video Technology | 2005
Hung-Chi Fang; Yu-Wei Chang; Tu-Chih Wang; Chung-Jr Lian; Liang-Gee Chen
This paper presents a parallel architecture for the Embedded Block Coding (EBC) in JPEG 2000. The architecture is based on the proposed word-level EBC algorithm. By processing all the bit planes in parallel, the state variable memories for the context formation (CF) can be completely eliminated. The length of the FIFO (first-in first-out) between the CF and the arithmetic encoder (AE) is optimized by a reconfigurable FIFO architecture. To reduce the hardware cost of the parallel architecture, we proposed a folded AE architecture. The parallel EBC architecture can losslessly process 54 MSamples/s at 81 MHz, which can support HDTV 720p resolution at 30 frames/s.