Juan A. Michell
University of Cantabria
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Juan A. Michell.
IEEE Transactions on Signal Processing | 1999
J. Garcia; Juan A. Michell; Angel M. Buron
This paper presents a full custom one-bit slice delay commutator for a pipeline split radix FFT (SRFFT) architecture, implemented using the true single-phase-clock (TSPC) circuit technique and a 1.0-/spl mu/m CMOS technology. This circuit can be configured or all intermediate SRFFT computation levels for transforms of lengths up to N=2048, where N is power of two. The circuit has been tested up to 200 MHz, having a power consumption of 1.1 W at 5 V of power supply.
Proceedings of SPIE | 2007
Jesús García; Juan A. Michell; Gustavo A. Ruiz; Angel M. Buron
Applications based on Fast Fourier Transform (FFT) such as signal and image processing require high computational power, plus the ability to choose the algorithm and architecture to implement it. This paper explains the realization of a Split Radix FFT (SRFFT) processor based on a pipeline architecture reported before by the same authors. This architecture has as basic building blocks a Complex Butterfly and a Delay Commutator. The main advantages of this architecture are: * To combine the higher parallelism of the 4r-FFTs and the possibility of processing sequences having length of any power of two. * The simultaneous operation of multipliers and adder-subtracters implicit in the SRFFT, which leads to faster operation at the same degree of pipeline. The implementation has been made on a Field Programmable Gate Array (FPGA) as a way of obtaining high performance at economical price and a short time of realization. The Delay Commutator has been designed to be customized for even and odd SRFFT computation levels. It can be used with segmented arithmetic of any level of pipeline in order to speed up the operating frequency. The processor has been simulated up to 350 MHz, with an EP2S15F672C3 Altera Stratix II as a target device, for a transform length of 256 complex points.
Signal Processing-image Communication | 2011
Gustavo A. Ruiz; Juan A. Michell
Motion estimation (ME) is the most critical component of a video coding standard. H.264/AVC adopts the variable block size motion estimation (VBSME) to obtain excellent coding efficiency, but the high computational complexity makes design difficult. This paper presents an effective processor chip for integer motion estimation (IME) in H264/AVC based on the full-search block-matching algorithm (FSBMA). It uses architecture with a configurable 2D systolic array to obtain a high data reuse of search area. This systolic array supports a three-direction scan format in which only one row of pixels is changed between the two adjacent subblocks, thus reducing the memory accesses and saving clock cycles. A computing array of 64 PEs calculates the SAD of basic 4x4 subblocks and a modified Lagrangian cost is used as matching criterion to find the best 41 variable-size blocks by means of a tree pipeline parallel architecture. Finally, a mode decision module uses serial data flow to find the best mode by comparing the total minimum Lagrangian costs. The IME processor chip was designed in UMC 0.18@mm technology resulting in a circuit with only 32.3k gates and 6 RAMs (total 59kBits on-chip memory). In typical working conditions (25^oC, 1.8V), a clock frequency of 300MHz can be estimated with a processing capacity for HDTV (1920x1088 @ 30fps) and a search range of 32x32.
IEEE Transactions on Signal Processing | 2005
Gustavo A. Ruiz; Juan A. Michell; Angel M. Buron
The Integer Cosine Transform (ICT) presents a performance close to Discrete Cosine Transform (DCT) with a reduced computational complexity. The ICT kernel is integer-based, so computation only requires adding and shifting operations. This work presents a parallel-pipelined architecture of an 8/spl times/8 forward two-dimensional (2-D) ICT(10,9,6,2,3,1) processor for image encoding. A fully pipelined row-column decomposition method based on two one-dimensional (1-D) ICTs and a transpose buffer based on D-type flip-flops is used. The main characteristics of 1-D ICT architecture are high throughput, parallel processing, reduced internal storage, and 100% efficiency in computational elements. The arithmetic units are distributed and are made up of adders/subtractors operating at half the frequency of the input data rate. In this transform, the truncation and rounding errors are only introduced at the final normalization stage. The normalization coefficient word length of 18-bit (13-bit effective) has been established using the requirements of IEEE standard 1180-1990 as a reference. The processor has been implemented using standard cell design methodology in 0.35-/spl mu/m CMOS technology, measures 9.3 mm/sup 2/, and contains 12.4 k gates. The maximum frequency is 300 MHz with a latency of 214 cycles (260 cycles with normalization).
IEEE Transactions on Signal Processing | 1998
Gustavo A. Ruiz; Juan A. Michell
In this correspondence, a processor chip programmable between N=8 and N=1024 for the unidimensional inverse Haar transform (1-D-IFHT) is presented. The processor uses a low latency data-flow with an architecture that minimizes the internal memory and an adder/subtracter as the only computing element. The control logic has a single and modular structure and can be easily extended to longer transforms. A prototype of the 1-D-IFHT processor has been implemented using a standard-cell design methodology and a 1.0-/spl mu/m CMOS process on a 11.7 mm/sup 2/ die. The maximum data rate is close to 60 MHz.
Proceedings of SPIE | 2007
Gustavo A. Ruiz; Juan A. Michell
The H.264/AVC (Advanced Video Codec) is the latest standard for video coding. It assumes a scalar forward quantizer performed at the encoder which can be implemented directly in integer arithmetic. An efficient architecture for the computation of forward quantization of H.264/AVC is presented in this paper. It uses a modification of the quantization operation which reduces the arithmetic operations, and a truncated Booth multiplier based on adaptative statistical approach, which reduces the hardware. The JM reference softwares C code has been re-written to analyze the effect of new algorithm and of truncated Booth multiplier. Simulations made up over popular test sequences used in video standardization show the validity of this approach. These results demonstrate that, at low QP, the PSNR is improved between a maximum of +0.81db and a minimum of 0.31db, with a slight increase in the Bit Rate being around 0.8%. Finally, a suitable architecture for VLSI implementation is presented, which reduces in a 26% the area, 32% the power and 21% the critical path delay in comparison with classical implementation. Moreover, it also reduces the area and increase the speed in comparison with architectures presented in references.
Signal Processing-image Communication | 2011
Juan A. Michell; José M. Solana; Gustavo A. Ruiz
In July 2004, a new amendment called Fidelity Range Extensions (FRExt) was added to the H.264/AVC as a standardization initiative motivated by the rapidly growing demands when coding higher-fidelity video material. One improvement present in the FRExt is the inclusion of a new 8x8 integer transform that only makes use of additions and shifters to avoid mismatches between encoders and decoders. This paper presents a processor with pipeline architecture for real-time implementation of the complete process for the 8x8 Transform Coding in H.264: forward 8x8 integer transform, quantization and scaling, re-scaling, inverse 8x8 integer transform and reconstruction of the image block. This architecture has been conceived with the aim of achieving a high operation frequency and high throughput without increasing the hardware complexity. In order to achieve an efficient implementation, hardware solutions have been developed for the different circuit modules. 8x8 forward and inverse transforms are calculated using the separability property with architecture more suitable for pipeline schemes made up of two 1D processors and a transpose register array. New expressions for forward quantization and scaling are presented allowing efficient hardware implementation by avoiding the sign conversion. The inverse quantization has also been optimized in terms of hardware complexity by minimizing the involved arithmetic operations. Furthermore, an exhaustive analysis in the dynamic range of the datapath is made to fix the optimum bus widths with the aim of reducing the size of the circuit while avoiding overflow. Finally, the critical paths of the various computing units have been carefully analyzed and balanced using a pipeline scheme in order to maximize the operation frequency without introducing an excessive latency. A prototype with the proposed architecture has been synthesized in a 130nm HCMOS technology process, which achieves a maximum speed of 330MHz with a throughput of 2640Mpixels/s.
international conference on image processing | 2003
Juan A. Michell; Gustavo A. Ruiz; Angel M. Buron
The integer cosine transform (ICT) has been shown to be an alternative to the DCT for image processing. This paper presents a parallel-pipelined architecture of an 8/spl times/8 ICT(I0, 9, 6, 2, 3, 1) processor for image compression. The main characteristics of this architecture are: high throughput, low latency, reduced internal storage and 100% efficiency in all computational elements. The processor has been designed in 0.35-/spl mu/m CMOS technology with an operational frequency of 300 MHz.
Proceedings of SPIE | 2003
Gustavo A. Ruiz; Juan A. Michell; Angel M. Buron; José M. Solana; Miguel A. Manzano; J. Diaz
The Discrete Cosine Transform (DCT) is the most widely used transform for image compression. The Integer Cosine Transform denoted ICT (10, 9, 6, 2, 3, 1) has been shown to be a promising alternative to the DCT due to its implementation simplicity, similar performance and compatibility with the DCT. This paper describes the design and implementation of a 8×8 2-D ICT processor for image compression, that meets the numerical characteristic of the IEEE std. 1180-1990. This processor uses a low latency data flow that minimizes the internal memory and a parallel pipelined architecture, based on a numerical strength reduction Integer Cosine Transform (10, 9, 6, 2, 3, 1) algorithm, in order to attain high throughput and continuous data flow. A prototype of the 8×8 ICT processor has been implemented using a standard cell design methodology and a 0.35-μm CMOS CSD 3M/2P 3.3V process on a 10 mm2 die. Pipeline circuit techniques have been used to attain the maximum frequency of operation allowed by the technology, attaining a critical path of 1.8ns, which should be increased by a 20% to allow for line delays, placing the estimated operational frequency at 500Mhz. The circuit includes 12446 cells, being flip-flops 6757 of them. Two clock signals have been distributed, an external one (fs) and an internal one (fs/2). The high number of flip-flops has forced the use of a strategy to minimize clock-skew, combining big sized buffers on the periphery and using wide metal lines (clock-trunks) to distribute the signals.
international conference on image processing | 2005
Gustavo A. Ruiz; Juan A. Michell; Angel M. Buron
This paper describes the architecture of an 8/spl times/8 2-D DCT/IDCT processor with high throughput, reduced hardware, and a parallel-pipeline scheme. This architecture allows the processing elements and arithmetic units to work in parallel at half the frequency of the data input rate. A fully pipelined row-column decomposition method based on two 1-D DCTs and a transpose buffer based on D-type flip-flops are used. The processor has been implemented in a 0.35-/spl mu/m CMOS process with a core area of 3mm/sup 2/ and 11.7k gates. It meets the requirements of IEEE Std. 1180-1990. The data input rate frequency is 300 MHz with a latency of 172 cycles for 2-D DCT and 178 cycles for 2-D IDCT. The proposed design is compact and suitable for HDTV applications.