Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where An-Yeu Wu is active.

Publication


Featured researches published by An-Yeu Wu.


IEEE Journal of Solid-state Circuits | 2008

An 8.29 mm

Xin-Yu Shih; Cheng-Zhou Zhan; Cheng-Hung Lin; An-Yeu Wu

This paper presents a multi-mode decoder design for Quasi-Cyclic LDPC codes for Mobile WiMAX system. This chip can be operated in 19 kinds of modes specified in Mobile WiMAX system, including block sizes of 576,..., 2304. There are four proposed design techniques: reordering of the base matrix, overlapped operations of main computational units, early termination strategy and multi-mode design strategy. Based on overlapped decoding mechanism, the decoding latency can be reduced to 68.75% of non-overlapped method, and the hardware utilization ratio can be enhanced from 50% to 75%. Besides, the proposed early termination strategy can dynamically adjust the number of iterations when dealing with communication channels of different SNR values. The proposed multi-mode LDPC decoder design is implemented and fabricated in TSMC 0.13 mum 1.2 V 1P8M CMOS technology. The maximum operating frequency is measured 83.3 MHz and the corresponding power dissipation is 52 mW. The core size is 4.45 mm2 and the die area only occupies 8.29 mm2.


networks on chips | 2010

^{2}

Chih-Hao Chao; Kai-Yuan Jheng; Hao-Yu Wang; Jia-Cheng Wu; An-Yeu Wu

Three-dimensional network-on-chip (3D NoC), the combination of NoC and die-stacking 3D IC technology, is motivated to achieve lower latency, lower power consumption, and higher network bandwidth. However, the length of heat conduction path and power density per unit area increase as more dies stack vertically. Routers of NoC have comparable thermal impact as processors and contributes significant to overall chip temperature. High temperature increases the vulnerability of the system in performance, power, reliability, and cost. To ensure both thermal safety and less performance impact from temperature regulation, we propose a traffic- and thermal-aware run-time thermal management (RTM) scheme. The scheme is composed of a proactive downward routing and a reactive vertical throttling. Based on a validated traffic-thermal mutual-coupling co-simulator, our experiments show the proposed scheme is effective. The proposed RTM can be combined with thermal-aware mapping techniques to have potential for higher run-time thermal safety.


networks on chips | 2007

52 mW Multi-Mode LDPC Decoder Design for Mobile WiMAX System in 0.13

Wein-Tsung Shen; Chih-Hao Chao; Yu-Kuang Lien; An-Yeu Wu

This paper presents an efficient binomial IP mapping and optimization algorithm (BMAP) to reduce the hardware cost of on-chip network (OCN) infrastructure. The complexity of BMAP is O(N2log(N)). Based on our OCN system synthesis flow, the proposed algorithm provides more economic network component mapping in comparison with traditional OCN mapping algorithm. The experimental result shows total traffic on network is reduced by 37% and average network hop count is reduced by 46%. With further optimization, the hardware efficiency is enhanced therefore the total hardware cost of network infrastructure is reduced to 51%~85%


IEEE Transactions on Circuits and Systems Ii: Analog and Digital Signal Processing | 2001

\mu

Cheng-Shing Wu; An-Yeu Wu

The CORDIC algorithm is a well-known iterative method for the computation of vector rotation. However, the major disadvantage is its relatively slow computational speed. For applications that require forward rotation (or vector rotation) only, we propose a new scheme, the modified vector rotational CORDIC (MVR-CORDIC) algorithm, to improve the speed performance of CORDIC algorithm. The basic idea of the proposed scheme is to reduce the iteration number directly while maintaining the SQNR performance. This can be achieved by modifying the basic microrotation procedure of CORDIC algorithm. Meanwhile, three searching algorithms are suggested to find the corresponding directional and rotational sequences so as to obtain the best SQNR performance. Three SQNR performance refinement schemes are also suggested in this paper. Namely, the selective prerotation scheme, selective scaling scheme, and iteration-tradeoff scheme. They can reduce and balance the quantization errors encountered in both microrotation and scaling phases so as to further improve the overall SQNR performance. Then, by combining these three refinement schemes, we provide a systematic design flow as well as the optimization procedure in the application of MVR-CORDIC algorithm. Finally, we present two VLSI architectures for the MVR-CORDIC algorithm. It shows that by using the proposed MVR-CORDIC algorithm, we can save 50% execution time in the iterative CORDIC structure, or 50% hardware complexity in the parallel CORDIC structure compared with the conventional CORDIC scheme.


IEEE Transactions on Circuits and Systems | 2005

m CMOS Process

Chih-Hsiu Lin; An-Yeu Wu

The coordinate rotational digital computer (CORDIC) algorithm is a well-known iterative arithmetic for performing vector rotations in many digital signal processing (DSP) applications. However, the large number of iteration is a major disadvantage of this algorithm for its speed performance. Many researchers have proposed schemes to reduce the number of iterations. Nevertheless, in performing the existing CORDIC algorithms, the norm of the vector is usually enlarged so that extra scaling operations are required to deliver the normalized output. In this paper, we merge the two operation phases (microrotations and scaling phases) and propose a new vector rotational scheme called mixed-scaling-rotation coordinate rotational digital computer (MSR-CORDIC) algorithm. It can eliminate the overhead of the scaling operations that are inevitable in existing CORDIC algorithms; hence, it can significantly reduce the total iteration number so as to improve the speed performance. The proposed MSR-CORDIC can be applied to DSP applications, in which the rotational angles are known in advance [e.g., twiddle factor in fast Fourier transform (FFT) processor designs]. Moreover, most CORDIC algorithms generally suffer from the roundoff noise in the fixed-wordlength implementations. We also propose two schemes to control and reduce the impairment. Our simulation results show that the MSR-CORDIC algorithm can enhance the signal-to-quantization-noise ratio (SQNR) performance by controlling the internal dynamic range. We also investigate the first- and second-order statistical properties, including the mean and variance of the SQNR. Simulation results show that the MSR-CORDIC can enhance SQNR performance of both first- and second-order statistical properties. At the VLSI architecture level, we proposed a generalized MSR-CORDIC engine for the tradeoff between hardware complexity and quantization error performance. It can further reduce the hardware complexity when compared with the newly proposed extend elementary angle set CORDIC algorithm . The MSR-CORDIC scheme has been applied to a variable-length FFT processor design , and results in significant hardware reduction in implementing the twiddle factor operations.


IEEE Transactions on Circuits and Systems Ii: Analog and Digital Signal Processing | 2003

Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems

Cheng-Shing Wu; An-Yeu Wu; Chih-Hsiu Lin

The coordinate rotational digital computer (CORDIC) algorithm is a well-known iterative method for the computation of vector rotation. For applications that require forward rotation (or vector rotation) only, the angle recoding (AR) technique provides a relaxed approach to speed up the operation of the CORDIC algorithm. In this paper, we further apply the concept of AR technique to extend the elementary angle set in the microrotation phase. This technique is called the extended elementary-angle set (EEAS) scheme. The proposed EEAS scheme provides a more flexible way of decomposing the target rotation angle in CORDIC operation, and its quantization error performance is better than the AR technique. Meanwhile, to solve the optimization problem encountered in the EEAS scheme, we also proposed a novel search algorithm, called the trellis-based searching (TBS) algorithm. Compared with the greedy algorithm used in the conventional AR technique, the proposed TBS algorithm yields apparent signal-to-quantization-noise ratio (SQNR) improvement. Moreover, in the scaling phase of the EEAS-based CORDIC algorithm, we suggest a novel scaling operation, called Extended Type-II (ET-II) scaling operation. The ET-II scaling operation applies the same design concepts as the EEAS scheme. It results in much smaller quantization error than conventional Type-I scaling operation in the numerical approximation of scaling factor. By combining the aforementioned new schemes, the proposed EEAS-based CORDIC algorithm can improve the overall SQNR performance by up to 25 dB compared with previous works. Also, given the same target SQNR performance, we require only about 66% iteration number in the iterative CORDIC structure, or use 66% hardware complexity in the parallel CORDIC structure compared with conventional AR technique. Hence, high-performance/low-latency CORDIC very large-scale integration architectures can be achieved without degrading the SQNR performance.


EURASIP Journal on Advances in Signal Processing | 2003

A New Binomial Mapping and Optimization Algorithm for Reduced-Complexity Mesh-Based On-Chip Network

Jen-Chih Kuo; Ching-Hua Wen; Chih-Hsiu Lin; An-Yeu Wu

The technique of orthogonal frequency division multiplexing (OFDM) is famous for its robustness against frequency-selective fading channel. This technique has been widely used in many wired and wireless communication systems. In general, the fast Fourier transform (FFT) and inverse FFT (IFFT) operations are used as the modulation/demodulation kernel in the OFDM systems, and the sizes of FFT/IFFT operations are varied in different applications of OFDM systems. In this paper, we design and implement a variable-length prototype FFT/IFFT processor to cover different specifications of OFDM applications. The cached-memory FFT architecture is our suggested VLSI system architecture to design the prototype FFT/IFFT processor for the consideration of low-power consumption. We also implement the twiddle factor butterfly processing element (PE) based on the coordinate rotation digital computer (CORDIC) algorithm, which avoids the use of conventional multiplication-and-accumulation unit, but evaluates the trigonometric functions using only add-and-shift operations. Finally, we implement a variable-length prototype FFT/IFFT processor with TSMCm 1P4M CMOS technology. The simulations results show that the chip can perform (-)-point FFT/IFFT operations up to operating frequency which can meet the speed requirement of most OFDM standards such as WLAN, ADSL, VDSL (), DAB, and-mode DVB.


Proceedings of the IEEE | 1998

Modified vector rotational CORDIC (MVR-CORDIC) algorithm and architecture

K.J.R. Liu; An-Yeu Wu; A. Raghupathy; Jie Chen

Low power and high performance are the two most important criteria for many signal-processing system designs, particularly in real-time multimedia applications. There have been many approaches to achieve these two design goals at many different implementation levels ranging from very-large-scale-integration fabrication technology to system design. We review the works that have been done at various levels and focus on the algorithm-based approaches for low-power and high-performance design of signal processing systems. We present the concept of multirate computing that originates from filterbank design, then show how to employ it along with the other algorithmic methods to develop low-power and high-performance signal processing systems. The proposed multirate design methodology is systematic and applicable to many problems. We demonstrate that multirate computing is a powerful tool at the algorithmic level that enables designers to achieve either significant power reduction or high throughput depending on their choice. Design examples on basic multimedia processing blocks such as filtering, source coding, and channel coding are given. A digital signal-processing engine that is an adaptive reconfigurable architecture is also derived from the common features of our approach. Such an architecture forms a new generation of high-performance embedded signal processor based on the adaptive computing model. The goal of this paper is to demonstrate the flexibility and effectiveness of algorithm-based approaches and to show that the multirate approach is an effective and systematic design methodology to achieve low-power and high throughput signal processing at the algorithmic and architectural level.


international symposium on vlsi design, automation and test | 2010

Mixed-scaling-rotation CORDIC (MSR-CORDIC) algorithm and architecture for high-performance vector rotational DSP applications

Kai-Yuan Jheng; Chih-Hao Chao; Hao-Yu Wang; An-Yeu Wu

Thermal issue is one of the major challenges in the research field of three-dimensional (3D) IC. Network-on-Chip (NoC) has been viewed as a practical communication infrastructure for 3D IC. To facilitate such research, an accurate and non-proprietary environment for simulating the NoC traffic and temperature is necessary. In this paper, we present a traffic-thermal mutual-coupling co-simulation platform for 3D NoC. The translation error is eliminated, and therefore our co-simulation has no accuracy loss on mutual coupling. Our simulation results, validated with a commercial tool, show the temperature error of our platform is between −1 and 4 K. The simulation results also show the thermal profile of 3D NoC, in which the temperature is imbalance even under the balanced traffic. Hence, the proposed platform can be used for 3D thermal-aware design, 3D dynamic thermal management technology, and other related researches in the future.


international symposium on vlsi design, automation and test | 2008

A high-performance/low-latency vector rotational CORDIC architecture based on extended elementary angle set and trellis-based searching schemes

Tay-Jyi Lin; Chun-Nan Liu; Shau-Yin Tseng; Yuan-Hua Chu; An-Yeu Wu

The Industrial Technology Research Institute (ITRI) PAC (parallel architecture core) project was initiated in 2003. The target is to develop a low-power and high-performance programmable SoC platform for multimedia applications. In the first PAC project phase (2004-2006), a 5-way VLIW DSP (PACDSP) processor has been developed with our patented distributed & ping-pong register file and variable-length VLIW encoding techniques. A dual-core PAC SoC, which is composed of a PACDSP core and an ARM9 core, has also been designed and fabricated in the TSMC 0.13 mum technology to demonstrate its outstanding performance and energy efficiency for multimedia processing such as real-time H.264 codec. This paper summarizes the technical contents of PACDSP, DVFS (dynamic voltage and frequency scaling) -enabled PAC SoC, and the energy-aware multimedia codec. The research directions of our second-phase PAC project (PAC II), including multicore architectures, ESL (electronics system-level) technology, and low-power multimedia framework, are also addressed in this paper.

Collaboration


Dive into the An-Yeu Wu's collaboration.

Top Co-Authors

Avatar

Cheng-Zhou Zhan

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Kai-Yuan Jheng

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Chih-Hao Chao

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Cheng-Shing Wu

National Central University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

En-Jui Chang

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Chun-Yuan Chu

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Yen-Liang Chen

National Taiwan University

View shared research outputs
Researchain Logo
Decentralizing Knowledge