Lien-Fei Chen
National Chung Hsing University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lien-Fei Chen.
IEEE Transactions on Consumer Electronics | 2009
Yeong-Kang Lai; Lien-Fei Chen; Yui-Chih Shih
In this paper, we present a high performance and memory-efficient pipelined architecture with parallel scanning method for 2-D lifting-based DWT in JPEG2000 applications. The Proposed 2-D DWT architecture are composed of two 1-D DWT cores and a 2times2 transposing register array. The proposed 1-D DWT core consumes two input data and produces two output coefficients per cycle, and its critical path takes one multiplier delay only. Moreover, we utilize the parallel scanning method to reduce the internal buffer size instead of the line-based scanning method. For the NtimesN tile image with one-level 2-D DWT decomposition, only 4N temporal memory and the 2times2 register array are required for 9/7 filter to store the intermediate coefficients in the column 1-D DWT core. And the column-processed data can be rearranged in the transposing array. According to the comparison results, the hardware cost of the 1-D DWT core and the internal memory requirements of proposed 2-D DWT architecture are smaller than other familiar architectures based on the same throughput rate. The implementation results show that the proposed 2-D DWT architecture can process 1080 p HDTV pictures with five-level decomposition at 30 frames/sec.
international symposium on circuits and systems | 2008
Chong-Yu Huang; Lien-Fei Chen; Yeong-Kang Lai
In this paper, a high-speed two-dimensional (2-D) transform architecture with unique kernel for multi-standard video applications is proposed. On the basis of the new distributed arithmetic algorithm (NEDA), we utilize the recursive discrete cosine transform (DCT) algorithm to reduce the computational complexity of the NEDA and propose the unique kernel framework to unify the adder matrix design for different coefficient requirements. Owing to the proposed unique kernel framework, the adder matrices, which only have 13 adders, are independent to the coefficients of the 2-D transform. Therefore, many 2-D transform matrices can be easily realized via proposed unique kernel and the efficient routing network to accomplish multi-standard video coding requirement.
IEEE Transactions on Consumer Electronics | 2010
Yeong-Kang Lai; Lien-Fei Chen; Shien-Yu Huang
In this paper, a hybrid parallel motion estimation architecture based on the fast top-winners algorithm is proposed. In the first instance, the fast top-winners search algorithm is discussed based on the pel-subsampling technique to reduce the computational amount of the sum of absolute difference (SAD). Moreover, the four-parallel spiral scanning (4PSP) with the partial distortion elimination (PDE) mechanism is also utilized to early terminate the unnecessary SAD. Therefore, the proposed fast algorithm can not only avoid trapping into the problem of the local minimum but also save the computational operations with a little performance degradation. According to our proposed algorithm, the 4×4 processing element (PE) array and the dual mode SAD tree are proposed to efficiently perform SAD and Sub-SAD which is accumulated based on the pel-subsampling. For the sake of reducing the system memory bandwidth and decreasing the frequency of the memory access, the local memory configuration and the novel memory interleaving organization are proposed to arrange the current data and reference pixels easily, to access the image pixels efficiently, and to achieve the Level C (Lv. C) data reuse scheme.
international symposium on circuits and systems | 2004
Yeong-Kang Lai; Li-Chung Chang; Lien-Fei Chen; Chih-Chung Chou; Chun-Wei Chiu
In this paper, we present a novel fast S-box algorithm without lookup table method, and novel fast optional hardware architecture for MixColumn and Inverse MixColumn module with only 5 XOR gate delay. We use on-the-fly key schedule architecture for both encryption and decryption. Furthermore, we implement a memoryless AES cipher with proposed S-box architecture and fast MixColumn and Inverse MixColumn module by adopting a pipeline method to obtain high throughput as 1.454 Gbits/sec under 125 MHz using 0.25 m CMOS technology and the hardware cost is about 80 K gate counts. According to our knowledge, our hardware architecture is the first memoryless AES cipher including encryption and decryption function.
international symposium on circuits and systems | 2004
Lien-Fei Chen; Yeong-Kang Lai
In this paper, a novel reconfigurable computing engine for digital signal processing applications is proposed. The kernel component of the reconfigurable computing (RC) engine is the general-purpose processing cluster (GPPC) array, which is constructed of the GPPCs, as an MIMD model to achieve high flexibility for mapping applications and algorithms to the RC engine. GPPC performs the data-parallelism operations efficiently using the SIMD instructions. Therefore, GPPC can not only execute the 32-bit operations but also perform 4-way 8-bit operations or 2-way 16-bit operations simultaneously. For the efficient connectivity, the inter-GPPC-row reconfigurable network is also proposed to achieve the requirements of high flexibility, low complexity, small area and short network delay.
international symposium on circuits and systems | 2004
Yeong-Kang Lai; Lien-Fei Chen
A performance-driven configurable motion estimator for full search block-matching algorithm (FSBMA) is described in this paper. According to the performance requirements for different video applications, we may design a scalable and configurable architecture to meet the specifications of the function, the throughput rate and external memory bandwidth of systems requirements. The proposed architecture is designed to achieve the performance-driven requirements by configuring the following parameters: current block size (N), search range (p), and multiple-slice processing (m).
IEEE Transactions on Consumer Electronics | 2010
Yeong-Kang Lai; Lien-Fei Chen; Wei-Che Chiou
In this paper, a memory interleaving and interlacing VLSI architecture for deblocking filter in H.264/AVC is proposed. Many literatures and the results of the chip implementation show that the memory organization dominates the hardware cost, the throughput rate, and the external memory bandwidth of the deblocking filter. Hence, we also discuss three different levels of the data-reuse scheme for deblocking filter in this paper. In order to increase the throughput, we propose the memory interleaving techniques to arrange data in the on-chip memory and access the data in both horizontal and vertical filters efficiently. We also utilize the hybrid schedule for 2-D processing order and the memory interlacing configuration to reduce the total on-chip memory size and to accomplish the Level B data-reuse scheme. According to proposed memory interleaving organization, memory interlacing configuration and hybrid schedule of the 2-D process order, our architecture only utilizes a half of the traditional memory size to boost the throughput and reduce the bus memory bandwidth.
international conference on acoustics, speech, and signal processing | 2009
Lien-Fei Chen; Shin-Ping Yang; Yeong-Kang Lai
Inter prediction is the most power-consumed component in H.264/AVC encoder. For the power-aware design, it is necessary to reduce the power consumption by using fast inter prediction techniques. In this paper, a system-level power-aware algorithm based on the early termination scheme of H.264/AVC inter prediction is proposed. The proposed early termination scheme for H.264 motion estimation is based on the statistical modeling of the motion compensated residual data. We develop a power-aware adaptive mechanism with multiple thresholds derived from the statistical model to early terminate the motion estimation (ME). According to the experimental results, our early termination scheme for H.264/AVC inter prediction not only preserves fine RD performance but also eliminates the unnecessary operation in both IME stage and FME stage to realize the power-aware H.264 encoder system.
international conference on multimedia and expo | 2008
Lien-Fei Chen; Kun-Hsing Li; Chong-Yu Huang; Yeong-Kang Lai
Intra mode decision is the significant technique to improve the coding performance of H.264/AVC intra frame coder. Many intra frame coding architectures are proposed to achieve the high performance requirement by using the sum of absolute integer transformed differences (SAITD) mode decision instead of the sum of absolute transformed differences (SATD) mode decision. In this paper, we analyze the RD performance among the SAITD model and the SATD model, and validate that the SATD-based mode decision is better than SAITD model. In the light of the claim, an efficient multi-transform architecture with unique kernel is proposed to attempt deriving the coefficients of integer transforme and Hadamard transform simultaneously. Moreover, proposed architecture can be configured to perform all 4times4 transforms and 2times2 transform for luma and chroma coding. By utilizing our multi-transform architecture, H.264/AVC intra frame coder via the interleaved 14MB/16MB prediction schedule can be realized based on the SATD model.
international symposium on circuits and systems | 2009
Lien-Fei Chen; Shien-Yu Huang; Chi-Yao Liao; Yeong-Kang Lai
In this paper, a hardware efficient coarse-to-fine fast algorithm for H.264 motion estimation is proposed. We present hardware friendly two-step searching flow to obtain the 41 MVs of the variable block size motion estimation (VBSME) efficiently. At the first step, the candidate block down-sampling technique and the multi-level successive elimination algorithm (MSEA) with fixed 16×16 block-matching are adopted to rapidly find the possible regions. Then, the local full search method with VBSME is utilized at these possible regions to parallel calculate the minimum SAD of the 41MVs. According to the analysis, proposed fast algorithm not only has 5% computational complexity compared with the full search block-matching algorithm (FSBMA), but also preserves fine RD performance. In the light of our hardware evaluation, the proposed fast algorithm can easily achieve the real-time HDTV video coding requirement with the 64 processing elements (PEs) architecture.