Nelson Yen-Chung Chang
National Chiao Tung University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nelson Yen-Chung Chang.
IEEE Transactions on Circuits and Systems for Video Technology | 2010
Nelson Yen-Chung Chang; Tsung-Hsien Tsai; Bo-Hsiung Hsu; Yi-Chun Chen; Tian-Sheuan Chang
High-performance real-time stereo vision system is crucial to various stereo vision applications, such as robotics, autonomous vehicles, multiview video coding, freeview TV, and 3-D video conferencing. In this paper, we proposed a high-performance hardware-friendly disparity estimation algorithm called mini-census adaptive support weight (MCADSW) and also proposed its corresponding real-time very large scale integration (VLSI) architecture. To make the proposed MCADSW algorithm hardware-friendly, we proposed simplification techniques such as using mini-census, removing proximity weight, using YUV color representation, using Manhattan color distance, and using scaled-and-truncate weight approximation. After applied these simplifications, the MCADSW algorithm was not only hardware-friendly, but was also 1.63 times faster. In the corresponding real-time VLSI architecture, we proposed partial column reuse and access reduction with expanded window to significantly reduce the bandwidth requirement. The proposed architecture was implemented using United Microelectronics Corporation (UMC) 90 nm complementary metal-oxide-semiconductor technology and can achieve a disparity estimation frame rate of 42 frames/s for common intermediate format size images when clocked at 95 MHz. The synthesized gate-count and memory size is 563 k and 21.3 kB, respectively.
international conference on multimedia and expo | 2007
Nelson Yen-Chung Chang; Ting-Min Lin; Tsung-Hsien Tsai; Yu-Cheng Tseng; Tian-Sheuan Chang
Real-time DSP stereo matching solution has been important to various applications relying on stereo vision. We proposed a 4times5 jigsaw matching template and the dual-block parallel processing technique to enhance VLIW DSP stereo matchers performance. The 4times5 jigsaw template improves the matching quality by 1% compared with regular 4times5 block template while consuming the same amount of memory access bandwidth. Along with the benefit of the jigsaw template, the dual-block parallel processing technique, which doubles the throughput, is possible to be implemented for DSP. Together with instruction scheduling and operation pipelining, our DSP stereo matcher can achieve 50 FPS of 16 disparity levels for a 384times288 stereo image pair. Both quantitative and qualitative stereo matching results are provided at the end of this work.
international conference on multimedia and expo | 2007
Yu-Cheng Tseng; Nelson Yen-Chung Chang; Tian-Sheuan Chang
The typical belief propagation has good accuracy for stereo correspondence but suffers from large run-time memory cost. In this paper, we propose a block-based belief propagation algorithm for stereo correspondence that partitions an image into regular blocks for optimization. With independently partitioned blocks, the required memory size could be reduced significantly by 99% with slightly degraded performance with a 32times32 block size when compared to original one. Besides, such blocks are also suitable for parallel hardware implementation. Experimental results using Middlebury stereo test bed demonstrate the performance of the proposed method.
IEEE Transactions on Circuits and Systems for Video Technology | 2008
Hui-Cheng Hsu; Kun-Bin Lee; Nelson Yen-Chung Chang; Tian-Sheuan Chang
This paper presents efficient VLSI architectures of the shape-adaptive discrete cosine transform (SA-DCT) and its inverse transform (SA-IDCT) for MPEG-4. Two of the challenges encountered during the exploitation of more efficient architectures for the SA-DCT and SA-IDCT are addressed. One challenge is to handle the architectural irregularity due to the shape-adaptive nature. The other one is to provide acceptable throughput using minimal hardware. In the algorithm-level optimization, this work exploits the numerical properties found in the transform matrices of various lengths, and derives a fine-grained zero-skipping scheme for the IDCT which can perform 22.6% more zero-skipping than the common vector-based coarse-grained zero-skipping scheme does. In the architecture-level design, the 1-D variable-length DCT/IDCT architectures designed on the basis of the numerical properties are proposed. An auto-aligned transpose memory that aligns the data of different lengths is also incorporated. In addition, a zero-index table is also included in the transpose memory to support the fine-grained zero-skipping in the SA-IDCT. The synthesized designs of the SA-DCT and SA-IDCT are implemented using UMC 0.18-mum technology. The SA-DCT architecture has 26 635 gates, and its average cycle-throughput is 0.66 pixels/cycle, which is comparable to other proposed architectures. On the other hand, the SA-IDCT architecture has 29 960 gates, and its cycle-throughput is 6.42 pixels/cycle. While decoding for CIF@30FPS, the SA-IDCT is clocked at 0.7 MHz, and the power consumption is 0.14 mW. Both the throughput and power consumption of the proposed SA-IDCT architecture are an order better than those of the existing SA-IDCT architectures.
international conference on consumer electronics | 2003
Kun-Bin Lee; Hao-Yun Chin; Nelson Yen-Chung Chang; Hui-Cheng Hsu; Chein-Wei Jen
An optimal frame memory and data transfer scheme is proposed for MPEG-4 shape coding in embedded systems. The proposed alpha frame buffer scheme contains two approaches. First, a distributed tile-based memory organization is used to efficiently support the time-varying size of alpha plane. Second, a compression scheme is used to reduce the number of memory access to and the size of the alpha frame memory. Under the criteria of MPEG-4 standard, the size of alpha frame memory can be reduced to 50% by introducing a small index table (2.73%-5.08% of the original frame memory size). A coarse assessment shows that the number of memory reference can be reduced to 56.25%. On the other hand, the proposed data transfer scheme combines the run length coding and addressing mode to reduce average data transfer time to 9.39%. Therefore, the shared system bus can be kept as free as possible, which in turn leads to increasing the potentialities of improvement on system performance. Furthermore, this data transfer scheme also helps in accelerating the processing of shape coding.
Iet Computers and Digital Techniques | 2009
Nelson Yen-Chung Chang; Ying-Ze Liao; Tian-Sheuan Chang
Shared-link AXI provides decent communication performance and requires half the cost of its crossbar counterpart. The authors analysed the performance impact of the factors in a shared-link AXI system. The factors include interface buffer size, arbitration combination and task access setting (transfer mode mapping). A hybrid data locked transfer mode was also proposed to improve the performance due to AXIs extra transition cycle. The analysis is carried out by simulating a multi-core platform with a shared-link AXI backbone running a video phone application. The performance is evaluated in terms of bandwidth utilisation, average transaction latency and system task completion time. The analysis showed that channel-independent arbitration could contribute up to 23.2% of bandwidth utilisation and completion time difference. Moreover, the analysis suggests that the proposed hybrid data locked mode should be used only by long access latency devices. Such setting resulted in up to 21.1% completion time reduction compared with the setting without the hybrid data locked mode. The design options in shared-link AXI bus are also discussed.
asia pacific conference on circuits and systems | 2008
Nelson Yen-Chung Chang; Yu-Cheng Tseng; Tian Sheuan Chang
The impact of color space and similarity measure on complexity, speed, and performance of stereo matching is especially important to applications adopting stereo vision. This work analyzed the complexity of several most commonly considered color space and similarity measure. In addition, the execution speed and performance of color space and similarity measure combination are also compared on the same basis. The comparison result suggests that the Y-only rank provides the best combination under speed and performance trade-off.
IEEE Transactions on Circuits and Systems for Video Technology | 2006
Nelson Yen-Chung Chang; Tian-Sheuan Chang
The frame memory has long been the dominant component in a video decoder in terms of energy, area, and latency. We proposed a non-combined frame memory motion compensation (CFMMC) for video decoding which facilitates the characteristic of the perfect-matched macroblock (MB) to avoid unnecessary memory access and to save energy. The statistic result confirms that some sequences have more than 70% of MBs being perfect-matched MB. The CFMMC hardware architecture is further evaluated for latency, area, and energy. The hardware architecture shows that with SRAM-base frame memory, the equivalent gate count can be reduced by 37.7%, and the energy consumption and the latency may also be improved for sequences with enough percentage of perfect-matched MBs. Since the benefit of the CFMMC is highly dependent on the percentage of perfect-matched MBs, it is best suited for applications with large portion of static background, such as video surveillance, video telephony, and video conferencing
international symposium on circuits and systems | 2008
Tsung Hsien Tsai; Nelson Yen-Chung Chang; Tian Sheuan Chang
External memory bandwidth and internal memory size have been major bottlenecks in designing VLSI architecture for real-time stereo matching hardware because of large amount of pixel data and disparity range. To address these bottlenecks, this work explores the impact of data reuse on disparity-order and pixel-order along with the partial column reuse (PCR) and vertically expanded row reuse (VERR) techniques we proposed. The analysis suggest that a disparity-order reuse with both PCR and VERR techniques is suitable for low memory cost and low external bandwidth design, whereas the pixel-order reuse with both techniques is more suitable for low computation resource requirement.
international symposium on circuits and systems | 2004
Nelson Yen-Chung Chang; Kun-Bin Lee; Chien-Wei Jen
High-level performance estimation can help assess the performance in the early development stage of embedded systems with real-time constraints efficiently; however, the estimation accuracy has long been an issue. In this paper, a performance estimation method called trace-path analysis based on high-level execution path tracing analysis, which takes the effect of multiple execution paths into consideration, is proposed. Multiple execution path resolutions and a linear model for non-deterministic node cost are incorporated to yield better estimation accuracy. The estimation of error is also covered in this context so that the degree of estimation accuracy can be known. Experiments taking MPEG-4 shape coding as an example show that the proposed approach can achieve an average of 1.88% estimation error per QCIF frame, which is better than the 12.38% of the bitstream analysis approach.