Gustavo A. Ruiz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gustavo A. Ruiz is active.

Explore More

Publication

Featured researches published by Gustavo A. Ruiz.

IEEE Journal of Solid-state Circuits | 1998

Evaluation of three 32-bit CMOS adders in DCVS logic for self-timed circuits

Gustavo A. Ruiz

The efficient implementation of adders in differential logic can be carried out using a new generate signal (N) presented in this paper. This signal enables iterative shared transistor structures to be built with a better speed/area performance than a conventional implementation. It also allows adders developed in domino logic to be easily adapted to differential logic. Based on this signal, three 32-b adders in differential cascode switch voltage (DCVS) logic with completion circuit for applications in self-timed circuits have been fabricated in a standard 1.0-/spl mu/m two-level metal CMOS technology. The adders are: a ripple-carry (RC) adder, a carry look-ahead (CLA) adder, and a binary carry look-ahead (BCL) adder. The RC adder has the best levels of performance for random input data, but its delay is significantly influenced by the length of the carry propagation path, and thus is not recommended in circuits with nonrandom input operands. The BCL adder is the fastest but has a high cost in chip area. The CLA adder provides an intermediate option, with an area which is 20% greater than that of the RC adder. Its average delay is slightly greater than that of the other two adders, with an addition time which increases slowly with the carry propagate length even for adders with a high number of bits.

Microelectronics Journal | 2004

An area-efficient static CMOS carry-select adder based on a compact carry look-ahead unit

Gustavo A. Ruiz; Mercedes Granda

This paper presents a highly area-efficient CMOS carry-select adder (CSA) with a regular and iterative-shared transistor structure very suitable for implementation in VLSI. This adder is based on both a static and compact multi-output carry look-ahead (CLA) circuit and a very simple select circuit. Comparisons with other representative 32-bit CSAs show that the proposed adder reduces the area by between 25 and 16%, the number of transistors by between 43 and 30%, and the dynamic power supply between 35 and 16%, while maintaining a high speed.

Microelectronics Journal | 2011

Efficient canonic signed digit recoding

Gustavo A. Ruiz; Mercedes Granda

In this work novel-efficient implementations to convert a two’s complement binary number into its canonic signed digit (CSD) representation are presented. In these CSD recoding circuits two signals, H and K, functionally equivalent to two carries are described. They are computed in parallel reducing the critical path and they possess some properties that lead to a simplification of the algebraic expressions minimizing the overall hardware implementation. As a result, the proposed circuits are highly efficient in terms of speed and area in comparison with other counterpart previous architectures. Simulations of different configurations made over standard-cell implementations show an average reduction of about 55% in the delay and 29% in the area for a ripple-carry scheme, 47% in the delay and 17% the area in a carry look-ahead scheme, and 36% in the delay and 31% the area in a parallel prefix scheme.

Proceedings of SPIE | 2007

FPGA realization of a Split Radix FFT processor

Jesús García; Juan A. Michell; Gustavo A. Ruiz; Angel M. Buron

Applications based on Fast Fourier Transform (FFT) such as signal and image processing require high computational power, plus the ability to choose the algorithm and architecture to implement it. This paper explains the realization of a Split Radix FFT (SRFFT) processor based on a pipeline architecture reported before by the same authors. This architecture has as basic building blocks a Complex Butterfly and a Delay Commutator. The main advantages of this architecture are: * To combine the higher parallelism of the 4r-FFTs and the possibility of processing sequences having length of any power of two. * The simultaneous operation of multipliers and adder-subtracters implicit in the SRFFT, which leads to faster operation at the same degree of pipeline. The implementation has been made on a Field Programmable Gate Array (FPGA) as a way of obtaining high performance at economical price and a short time of realization. The Delay Commutator has been designed to be customized for even and odd SRFFT computation levels. It can be used with segmented arithmetic of any level of pipeline in order to speed up the operating frequency. The processor has been simulated up to 350 MHz, with an EP2S15F672C3 Altera Stratix II as a target device, for a transform length of 256 complex points.

Signal Processing-image Communication | 2011

An efficient VLSI processor chip for variable block size integer motion estimation in H.264/AVC

Gustavo A. Ruiz; Juan A. Michell

Motion estimation (ME) is the most critical component of a video coding standard. H.264/AVC adopts the variable block size motion estimation (VBSME) to obtain excellent coding efficiency, but the high computational complexity makes design difficult. This paper presents an effective processor chip for integer motion estimation (IME) in H264/AVC based on the full-search block-matching algorithm (FSBMA). It uses architecture with a configurable 2D systolic array to obtain a high data reuse of search area. This systolic array supports a three-direction scan format in which only one row of pixels is changed between the two adjacent subblocks, thus reducing the memory accesses and saving clock cycles. A computing array of 64 PEs calculates the SAD of basic 4x4 subblocks and a modified Lagrangian cost is used as matching criterion to find the best 41 variable-size blocks by means of a tree pipeline parallel architecture. Finally, a mode decision module uses serial data flow to find the best mode by comparing the total minimum Lagrangian costs. The IME processor chip was designed in UMC 0.18@mm technology resulting in a circuit with only 32.3k gates and 6 RAMs (total 59kBits on-chip memory). In typical working conditions (25^oC, 1.8V), a clock frequency of 300MHz can be estimated with a processing capacity for HDTV (1920x1088 @ 30fps) and a search range of 32x32.

IEEE Transactions on Signal Processing | 2005

Parallel-pipeline 8/spl times/8 forward 2-D ICT processor chip for image coding

Gustavo A. Ruiz; Juan A. Michell; Angel M. Buron

The Integer Cosine Transform (ICT) presents a performance close to Discrete Cosine Transform (DCT) with a reduced computational complexity. The ICT kernel is integer-based, so computation only requires adding and shifting operations. This work presents a parallel-pipelined architecture of an 8/spl times/8 forward two-dimensional (2-D) ICT(10,9,6,2,3,1) processor for image encoding. A fully pipelined row-column decomposition method based on two one-dimensional (1-D) ICTs and a transpose buffer based on D-type flip-flops is used. The main characteristics of 1-D ICT architecture are high throughput, parallel processing, reduced internal storage, and 100% efficiency in computational elements. The arithmetic units are distributed and are made up of adders/subtractors operating at half the frequency of the input data rate. In this transform, the truncation and rounding errors are only introduced at the final normalization stage. The normalization coefficient word length of 18-bit (13-bit effective) has been established using the requirements of IEEE standard 1180-1990 as a reference. The processor has been implemented using standard cell design methodology in 0.35-/spl mu/m CMOS technology, measures 9.3 mm/sup 2/, and contains 12.4 k gates. The maximum frequency is 300 MHz with a latency of 214 cycles (260 cycles with normalization).

IEEE Transactions on Signal Processing | 1998

Memory efficient programmable processor chip for inverse Haar transform

Gustavo A. Ruiz; Juan A. Michell

In this correspondence, a processor chip programmable between N=8 and N=1024 for the unidimensional inverse Haar transform (1-D-IFHT) is presented. The processor uses a low latency data-flow with an architecture that minimizes the internal memory and an adder/subtracter as the only computing element. The control logic has a single and modular structure and can be easily extended to longer transforms. A prototype of the 1-D-IFHT processor has been implemented using a standard-cell design methodology and a 1.0-/spl mu/m CMOS process on a 11.7 mm/sup 2/ die. The maximum data rate is close to 60 MHz.

Proceedings of SPIE | 2007

Low-cost VLSI architecture design for forward quantization of H.264/AVC

Gustavo A. Ruiz; Juan A. Michell

The H.264/AVC (Advanced Video Codec) is the latest standard for video coding. It assumes a scalar forward quantizer performed at the encoder which can be implemented directly in integer arithmetic. An efficient architecture for the computation of forward quantization of H.264/AVC is presented in this paper. It uses a modification of the quantization operation which reduces the arithmetic operations, and a truncated Booth multiplier based on adaptative statistical approach, which reduces the hardware. The JM reference softwares C code has been re-written to analyze the effect of new algorithm and of truncated Booth multiplier. Simulations made up over popular test sequences used in video standardization show the validity of this approach. These results demonstrate that, at low QP, the PSNR is improved between a maximum of +0.81db and a minimum of 0.31db, with a slight increase in the Bit Rate being around 0.8%. Finally, a suitable architecture for VLSI implementation is presented, which reduces in a 26% the area, 32% the power and 21% the critical path delay in comparison with classical implementation. Moreover, it also reduces the area and increase the speed in comparison with architectures presented in references.

Signal Processing-image Communication | 2011

A high-throughput ASIC processor for 8×8 transform coding in H.264/AVC

Juan A. Michell; José M. Solana; Gustavo A. Ruiz

In July 2004, a new amendment called Fidelity Range Extensions (FRExt) was added to the H.264/AVC as a standardization initiative motivated by the rapidly growing demands when coding higher-fidelity video material. One improvement present in the FRExt is the inclusion of a new 8x8 integer transform that only makes use of additions and shifters to avoid mismatches between encoders and decoders. This paper presents a processor with pipeline architecture for real-time implementation of the complete process for the 8x8 Transform Coding in H.264: forward 8x8 integer transform, quantization and scaling, re-scaling, inverse 8x8 integer transform and reconstruction of the image block. This architecture has been conceived with the aim of achieving a high operation frequency and high throughput without increasing the hardware complexity. In order to achieve an efficient implementation, hardware solutions have been developed for the different circuit modules. 8x8 forward and inverse transforms are calculated using the separability property with architecture more suitable for pipeline schemes made up of two 1D processors and a transpose register array. New expressions for forward quantization and scaling are presented allowing efficient hardware implementation by avoiding the sign conversion. The inverse quantization has also been optimized in terms of hardware complexity by minimizing the involved arithmetic operations. Furthermore, an exhaustive analysis in the dynamic range of the datapath is made to fix the optimum bus widths with the aim of reducing the size of the circuit while avoiding overflow. Finally, the critical paths of the various computing units have been carefully analyzed and balanced using a pipeline scheme in order to maximize the operation frequency without introducing an excessive latency. A prototype with the proposed architecture has been synthesized in a 130nm HCMOS technology process, which achieves a maximum speed of 330MHz with a throughput of 2640Mpixels/s.

international conference on image processing | 2003

Parallel-pipelined architecture for 2-D ICT VLSI implementation

Juan A. Michell; Gustavo A. Ruiz; Angel M. Buron

The integer cosine transform (ICT) has been shown to be an alternative to the DCT for image processing. This paper presents a parallel-pipelined architecture of an 8/spl times/8 ICT(I0, 9, 6, 2, 3, 1) processor for image compression. The main characteristics of this architecture are: high throughput, low latency, reduced internal storage and 100% efficiency in all computational elements. The processor has been designed in 0.35-/spl mu/m CMOS technology with an operational frequency of 300 MHz.

Explore More