Shen-Fu Hsiao
National Sun Yat-sen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shen-Fu Hsiao.
international conference on consumer electronics | 2000
Shen-Fu Hsiao; Yor-Chin Tai; Kai-Hsiang Chang
A VLSI architecture for the embedded zerotree wavelet (EZW) algorithm is presented that enables real-time scalable image coding. The breadth-first bottom-up search method is adopted in scanning the wavelet coefficients in the ancestor-descendant tree hierarchy in order to easily locate the parent-children relationship and to increase the processing speed. The symbols generated in the significance mapping (SMAP) process and those in the successive approximation quantization (SAQ) process are encoded independently. Compared to previously proposed architectures, our design leads to fewer transmitted bits and thus alleviates the communication overhead without sacrificing peak-signal-noise-ratio (PSNR). In addition, a simple progressive digital watermarking scheme is included in the EZW coder for purpose of copyright protection.
international conference on consumer electronics | 1999
Shen-Fu Hsiao; Wei-Ren Shiue; Jian-Ming Tseng
A novel low-cost and low-power linear array for computation of discrete cosine transform (DCT) and its inverse is derived from the heterogeneous dependence graphs representing the factorized coefficient matrices. Due to the novel algorithm and the corresponding efficient architectural design, the new DCT/IDCT processor is easily pipelined and the power consumption can be reduced significantly by turning off the operation of arithmetic units whenever possible.
international symposium on consumer electronics | 2007
Ruei-Ting Gu; Tse-Chen Yeh; Wei-Sheng Hunag; Ting-Yun Huang; Chung-Hua Tsai; Chung-Nan Lee; Ming-Chao Chiang; Shen-Fu Hsiao; Yun-Nan Chang; Ing-Jer Huang
This paper presents a 3D graphics engine which is specifically designed to minimize the hardware cost while providing sufficient computing capability for consumer electronics with small to medium screen sizes (up to 800times600) such as digital television. The presented 3D engine consists of a fixed full 3D graphics pipeline for both geometry and rendering operation. This engine provides a standard AHB interface that makes it easily to be integrated into an AMBA-based SoC. The development of the 3D engine has gone through a rigorous design process: starting from system modeling (using System-C), RTL implementation, hardware/software co-simulation and FPGA verification to test chip fabrication. This 3D engine provides 3.3 M vertices/s and 278 Mpixels/s in maximum performance at 139 MHz using 0.18 silicon technology with 987 K gates that is sufficient for most applications for digital television. At the same time, a complete OpenGL-ES 1.1 API, windowing system, Linux operating system, device driver and a 3D performance monitoring tool have been developed for our 3D engine. This performance monitoring tool provides run-time performance information include frame rate, triangle rate, pixel rate, involved OpenGL function list, function counts, memory utilization and etc. Moreover, a built-in real-time AHB bus tracer is also provided to monitor the bus activities of the 3D engine and other components on the system bus. The bus tracer captures on-chip bus signals at ether cycle accurate or transaction levels and applies real-time compression to both levels of signals. With the performance monitoring tool and the bus tracer, the 3D application developer can easily analyze the communication of the components and fine tune the 3D application to optimize the entire SoC system performance and to satisfy performance/cost constrains on consumer electronics. Both of the hardware and software have been carefully verified and demonstrated on FPGA using ARM versatile SoC develop board.
international symposium on vlsi design, automation and test | 2013
Shen-Fu Hsiao; Po-Han Wu; Chia-Sheng Wen; Li-Yao Chen
Recent OpenGL ES 2.0 API Specification for embedded systems graphics operations requires programmable vertex shaders to process vertex data. In order to facilitate 3D coordinate transformation and lighting operations, vertex shaders usually contain single instruction multiple data (SIMD) datapath and a special function unit (SFU). In this paper, we present a new design of the vertex shader processor in which a recently proposed non-uniform segmentation is adopted in the design of the special function unit in order to reduce the sizes of lookup tables (LUTs). Both fixed-point and floating-point arithmetic are supported to satisfy the requirements of various precisions and ranges. Compared with recent similar implementations, the proposed design has satisfactory energy efficiency with performance normalized by power consumption.
asia and south pacific design automation conference | 2003
Ming-Chih Chen; Shen-Fu Hsiao; Cheng-Hsien Yang
We design a specific Ethernet network interface card (NIC) for accelerating the video delivery by offloading the overheads of protocol headers identification/appending and CRC/checksums calculation, and speeding video bit streams with a dedicated video interface. Compared with the same operations of a 50MHz ARM micro-controller, the NIC system saves 47,000 ns per frame. This NIC card also supports the coexistence of the IPv4 and IPv6 standard for the future extension. Both FPGA prototyping and 0.35um cell-based design of the specific NIC system are given.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2015
Shen-Fu Hsiao; Po-Han Wu; Chia-Sheng Wen; Pramod Kumar Meher
Table-lookup-and-addition methods provide multiplierless function evaluation using multiple lookup tables and a multioperand adder. In spite of their high-speed operation, they are only practical in low-precision applications due to the fast increase in table size with precision width. In this brief, we present two methods for table size reduction by decomposing the original table of initial values into two or three tables with fewer entries and/or smaller bit width. The proposed table decompositions do not incur any extra rounding errors so that the original table can be completely recovered. Experimental results demonstrate significant saving of table sizes compared with the best of the prior designs of the multipartite methods.
international symposium on vlsi design, automation and test | 2008
Ruei-Ting Gu; Wei-Sheng Huang; Chien-Chou Wang; Wen-Chi Shiue; Tsung-Yu Ho; Chung-Hua Tsai; Tzu-Ching Tien; Da-Jing Zhang-Jian; Sheng-Yu Chiu; Ing-Jer Huang; Yun-Nan Chang; Shen-Fu Hsiao; Jin-Hua Hong; Chung-Nan Lee; Ming-Chao Chiang
A tiled-based 3D graphics IP is designed to support OpenGL ES 1.0. The test chip runs at 139 MHz and achieves 8.69 Mvertices/s and 278 Mpixels/s with its die size as 15.7 mm2. The IP includes embedded circuitry to monitor run time 3DG characteristics, detect bus protocol error and inefficiency, and capture bus trace at various abstraction levels with compression ratio up to 98%.
international symposium on vlsi design, automation and test | 2015
Hsu-Kang Dow; Ching-Hua Huang; Chun-Hung Lai; Kai-Hsiang Tsao; Sheng-Chih Tseng; Kun-Yi Wu; Ting-Hsuan Wu; Ho-Chun Yang; Da-Jing Zhang Jain; Yun-Nan Chang; Steve W. Haga; Shen-Fu Hsiao; Ing-Jer Huang; Shiann-Rong Kuang; Chung-Nan Lee
A multi-threaded programmable shader pipeline 3D graphics SoC with support for OpenGL ES 2.0 has been developed and fabricated. The sample chip is ARMv4T compatible with the 3D processing capability of 14.9 Mvertices/s, 3.6 Mpixels/s and up to 4K resolution. The die size is 3.85×3.85 mm2, with 2.96M gates on a TSMC 90nm CMOS 1P9M. This new SoC includes software to support OpenGL ES API libraries, GLSL compilation and simulation. The SoC also comes with various development tools, including GPU simulators for hardware validation, profile assisted compiler optimization and compiler verification. For developers, we also present a QEMU-based simulation platform and SoC Performance Monitoring Tool Suite (PMTS) to assist developers in optimizing the system and detecting performance bottlenecks.
international symposium on vlsi design, automation and test | 2014
Shen-Fu Hsiao; Wen-Ling Wang; Po-Sheng Wu
Dynamic Programming (DP)-based stereo matching consists of three major parts: matching cost computation (M.C.C.), minimum cost accumulation (M.C.A.), and disparity optimization (D.O.). This paper presents two architectures of implementations: array-based and memory-based. The array-based implementation is a systolic-like design consisting of regularly connected processing elements (PEs). The memory-based design replaces most of the PEs by memory units in order to reduce area cost. Both architectures adopt the concept of double buffer designs in order to process contiguous images. Experimental results show that the proposed design can achieve real-time processing speed at reasonable area cost.
digital systems design | 2014
Shen-Fu Hsiao; Chia-Sheng Wen; Po-Han Wu
Function evaluation is an important arithmetic computation in many signal processing applications, such as the special function unit in modern graphics processing units (GPUs). Lookup table (LUT) usually takes a significant portion of total area in function evaluation using piecewise polynomial approximation. Many papers have proposed various approaches to reduce table size without sacrificing precision requirement. This paper presents new LUT compression methods that do not introduce extra errors but can effectively further reduce the table size in the piecewise polynomial approximation with uniform or non-uniform segmentations.