Li-Hsun Chen
National Chung Cheng University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Li-Hsun Chen.
international symposium on circuits and systems | 2005
Li-Hsun Chen; Oscal T.-C. Chen; Teng-Yi Wang; Yung-Cheng Ma
A low-power multiplication-accumulation computation (MAC) unit using the radix-4 Booth algorithm is proposed; its architectural complexity is reduced and switching activities are minimized. However, to maintain a high performance, the critical delays and hardware complexities of MAC units are explored to derive a MAC unit with high performance and low hardware complexity. A carry-save addition operation with optimized compressors is proposed to omit the use of half adders to reduce the hardware complexity further. A scheme to reduce switching activity is also proposed to lower the power consumption of the MAC unit. In performing a MAC for X/spl times/Y+Z, the effective dynamic ranges of X and Y are detected; the one with the smaller effective dynamic range is processed for Booth decoding so as to increase the probability of the partial products being zero, and thus the switching activity is reduced. Also, the effective dynamic range of the result from this multiplication is estimated and compared with the effective dynamic range of the datum, Z. The larger effective dynamic range of the two data is considered as the effective word length for an addition operation. Pipelined latches are used to make the noneffective operation maintain the status of the previous operation so as to reduce the switching activities from the addition performed in MAC. After the addition operation, sign extension is performed on the result from the effective sign bit copied to non-effective bits to derive at a correct output datum. Compared to conventional MAC units, the proposed one is able to reduce 21.09% to 43.74% of power consumption. Additionally, the proposed MAC unit outperforms conventional ones in comparing the product of critical delay, area, and power consumption.
international conference on multimedia and expo | 2002
Li-Hsun Chen; Wei-Lung Liu; Oscal T.-C. Chen; Ruey-Liang Ma
In this work, the instruction-level and function-level profile analyses of a MPEG-4 video encoder are performed to design a reconfigurable digital signal processor (DSP) architecture. According to the result from the instruction-level profile analysis, the proposed DSP architecture would be lined up with 5 arithmetic logic units (ALUs), 1 multiplier, and 2 load/store units. Such a line-up in the computation units would allow the proposed DSP architecture to have a better parallel processing capability and a higher hardware usage rate in realizing the MPEG-4 video encoder. The result from the function-level profile analysis reveals that the function of motion estimation requires the most computation power. Hence, the proposed DSP architecture reconfigures 4 ALUs and a multiplier to become a functional unit for high parallel processing of motion estimation. This hardware design of motion estimation is primarily dependent on the adders and multiplier of the proposed DSP architecture, plus a few control circuits to convert the computation units. Such arrangement would have less hardware cost than in conventional video processors with specialized functional units for motion estimation. Lastly benchmark analysis and comparison are done between the proposed DSP architecture and TI TMS320C64x architecture. In processing the MPEG-4 video encoder, the proposed DSP architecture is as much as 80% more efficient in computation than the TI TMS320C64x architecture.
IEEE Transactions on Circuits and Systems for Video Technology | 2007
Oscal T.-C. Chen; Li-Hsun Chen; Nai-Wei Lin; Chih-Chang Chen
A novel mechanism that flexibly adapts data flows and configures computational units is proposed to establish an application-specific data path in the digital signal processor (DSP) that can efficiently perform multistandard video codecs. Based on this mechanism, the proposed application-specific data path, using the very long instruction word (VLIW) architecture with eight computational units of five arithmetic logic units (ALUs), one multiplier and two load/store units, is designed to perform five adaptive operations according to the characteristics of the low-level functions of MPEG-2, MPEG-4 and H.264/AVC video codecs. Using these adaptive operations, the proposed application-specific data path reduces the number of clock cycles required by the TI TMS320C64x data path to perform the low-level functions of the MPEG-2 video encoder and the H.264/AVC video decoder by 23.10% and 28.43%, respectively, for 30 352times288-pixel Foreman frames. Additionally, considering the operating frequency, the proposed application-specific data path reduces the computation time required by the TI TMS320C64x data path to realize the abovementioned encoder and decoder by 19.86% and 25.41%, respectively. Based on the TSMC 0.18-mum CMOS cell library, the proposed application-specific data path is implemented, and exhibits the highest ratio of computational power to hardware cost among all of the data paths associated with the conventional DSPs in implementing the low-level functions of video codecs
international symposium on circuits and systems | 2005
Li-Hsun Chen; Oscal T.-C. Chen
In this work, by using input-data and tap folding, a hardware-efficient FIR architecture is developed with a high folding number. To further minimize hardware complexity of an FIR architecture using the radix-4 Booth algorithm, 2-bit sub-data are utilized to replace the conventional 3-bit sub-data in order to reduce the number of input sub-data latches required in the input-data flow. Additionally, a tree accumulation with simplified carry-in bit processing is designed to reduce the hardware complexity of the accumulation path. With folding in input data and tap number of the architecture, and reduction in hardware complexity for the input sub-data latches and accumulation path, the proposed FIR architecture demonstrates a low hardware complexity. By using the TSMC 0.25 /spl mu/m CMOS technology, the proposed radix-4 Booth algorithm FIR architecture with 10-bit input data and filter coefficient to accomplish 128-tap filter operations not only satisfies the throughput-rate demand of the conventional architectures, it also saves 65% to 73% of the area occupied.
international symposium on circuits and systems | 2004
Li-Hsun Chen; Oscal T.-C. Chen; Teng-Yi Wang; Chi-Lung Wang
An adaptive digital signal processor (DSP) is proposed to realize the MPEG-4 video encoder at a high ratio of computation power versus hardware cost. First, software analyses are performed on the MPEG-4 video encoder, looking at the function and instruction levels. Analytical results from the function-level analysis show that the motion estimation inside the MPEG-4 video encoder has high computational complexity. As for the instruction-level analysis on the MPEG-4 video encoder without motion estimation, results reveal that the DSP equipped with 5 arithmetic logic units (ALUs), 1 multiplier and 2 Load/Store units has a higher computation performance than the other architectures with 8 functional units. Furthermore, the design of the adaptive mechanism can be incorporated in the said functional units. The suggestion is to group 4 ALUs and a multiplier into a special functional unit to exclusively process motion estimation. With the fact that a multiplier can be constructed by using multiple adders, the adaptable structure is also designed so that the numbers of adders and multipliers required can be dynamically changed to increase parallelism capability of local instructions, especially in discrete cosine transform and inverse discrete cosine transform. In comparing to conventional DSPs, the proposed adaptive DSP shows the best ratio of computation power over hardware cost to realize the MPEG-4 video encoder.
EURASIP Journal on Advances in Signal Processing | 2007
Oscal T.-C. Chen; Li-Hsun Chen
Advances in nanoelectronic fabrication have enabled integrated circuits to operate at a high frequency. The finite impulse response (FIR) filter needs only to meet real-time demand. Accordingly, increasing the FIR architectures folding number can compensate the high-frequency operation and reduce the hardware complexity, while continuing to allow applications to operate in real time. In this work, the folding scheme with integrating input-data and tap folding is proposed to develop a hardware-efficient programmable FIR architecture. With the use of the radix-4 Booth algorithm, the 2-bit input subdata approach replaces the conventional 3-bit input subdata approach to reduce the number of latches required to store input subdata in the proposed FIR architecture. Additionally, the tree accumulation approach with simplified carry-in bit processing is developed to minimize the hardware complexity of the accumulation path. With folding in input data and taps, and reduction in hardware complexity of the input subdata latches and accumulation path, the proposed FIR architecture is demonstrated to have a low hardware complexity. By using the TSMC 0.18m CMOS technology, the proposed FIR processor with 10-bit input data and filter coefficient enables a 128-tap FIR filter to be performed, which takes an area of 0.45, and yields a throughput rate of 20 M samples per second at 200 MHz. As compared to the conventional FIR processors, the proposed programmable FIR processor not only meets the throughput-rate demand but also has the lowest area occupied per tap.
international symposium on circuits and systems | 2003
Li-Hsun Chen; Oscal T.-C. Chen; Ruey-Ling Ma
In this work, a high-efficiency reconfigurable digital signal processor (DSP) that consists of two arithmetic logic units and a reconfigurable computation unit is designed. The design methodology for the reconfigurable computation unit is explored based on the intermediate grain framework. The proposed reconfigurable computation unit includes 8/spl times/8 array processing elements and interconnection paths where the processing element is based on two 8-bit ripple adders and simple logic gates. This reconfigurable computation unit can be configured to perform special operations such as two 16/spl times/16-bit multiplication, sixteen 32-bit addition/subtraction, one 16-bit dot product and sixteen 8-bit absolute that utilize these 64 processing elements in different connection topologies to increase their usage rates. In the benchmark analyses, the 8/spl times/8-pixel motion estimation and 8/spl times/8-pixel discrete cosine transform are realized in the proposed reconfigurable DSP, TI TMS320C64 and MorphoSys. Additionally, the comparison of computation performances and hardware costs is performed to show that the proposed reconfigurable DSP is able to achieve a higher computation performance at a low hardware cost. Therefore, the reconfigurable DSP proposed herein can achieve high-efficiency computing for various multimedia applications.
international symposium on circuits and systems | 2001
Li-Hsun Chen; Oscal T.-C. Chen
A low-complexity and high-speed transposed direct-form finite-impulse-response (FIR) architecture is developed based on the radix-4 Booth algorithm. It includes a pre-processing unit, input sub-data latches, a control unit, Booth decoders, filter coefficient registers, an accumulation path and a post-processing unit. To decrease hardware complexity, the pre-processing unit, input sub-data latches and Booth decoders are explored by using the 2-bit word length of sub-data latches instead of the conventional 3-bit one. In addition, the accumulation path using the carry save adders is designed by using the addition delay scheme to minimize the numbers of half adders and latches as compared to the conventional Booth-algorithm FIR architecture. The proposed FIR architecture can reduce more hardware complexities as the word length of input data, and the tap number of FIR increase. For example, when the 8-tap FIR with 8-bit input data, and the 256-tap FIR with 16-bit input data are designed, the proposed FIR architecture would save about 11% and 25% of hardware complexity respectively.
midwest symposium on circuits and systems | 2007
Li-Hsun Chen; Oscal T.-C. Chen
This work proposes an architecture for the H.264/AVC video decoder, of which each functional unit is modularly pipelined and optimized to reduce its hardware complexity. The local buffers are adequately allocated to expedite data communication and to minimize the data access from external memory, thereby to raise computation efficiency and to lower power consumption. By using the cell library of the TSMC 0.25 mum CMOS technology, the proposed hardware core of the H.264/AVC video decoder with a die size of 12.86 mum2 consumes 217.2 mW at 2.5 V and 27 MHz to yield a decoding throughput rate of 30 CIF frames per second. As compared to the conventional H.264/AVC video decoder, the proposed video decoder takes less power and hardware cost.
signal processing systems | 2006
Li-Hsun Chen; Oscal T.-C. Chen
Based on the radix-4 Booth algorithm, a scheme that integrates tap folding and coefficient folding is proposed to design a programmable finite impulse response (FIR) architecture with low power dissipation. In addition, without increasing hardware complexity and degrading computational performance, the effective selection on input data is realized to lower the operating frequencies of the latches and multiplexers involved with the input data. With the reduction on the frequency of the input data being selected to the Booth decoders, the power consumed in the Booth decoders can be also minimized. The proposed and conventional FIR architectures are implemented using the TSMC 0.18 mum CMOS technology. The areas and power consumption of these architectures are analyzed and compared. Under the same specifications and throughput rate, the results revealed that in comparison to the conventional architectures, the proposed FIR architecture not only saves about 18.18% to 39.19% of area occupied, it also reduces 14.23% to 25.56% in power consumption