Javier D. Bruguera
University of Santiago de Compostela
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Javier D. Bruguera.
IEEE Transactions on Computers | 2005
José-Alejandro Piñeiro; Stuart F. Oberman; Jean-Michel Muller; Javier D. Bruguera
A table-based method for high-speed function approximation in single-precision floating-point format is presented in this paper. Our focus is the approximation of reciprocal, square root, square root reciprocal, exponentials, logarithms, trigonometric functions, powering (with a fixed exponent p), or special functions. The algorithm presented here combines table look-up, an enhanced minimax quadratic approximation, and an efficient evaluation of the second-degree polynomial (using a specialized squaring unit, redundant arithmetic, and multioperand addition). The execution times and area costs of an architecture implementing our method are estimated, showing the achievement of the fast execution times of linear approximation methods and the reduced area requirements of other second-degree interpolation algorithms. Moreover, the use of an enhanced minimax approximation which, through an iterative process, takes into account the effect of rounding the polynomial coefficients to a finite size allows for a further reduction in the size of the look-up tables to be used, making our method very suitable for the implementation of an elementary function generator in state-of-the-art DSPs or graphics processing units (GPUs).
IEEE Transactions on Computers | 1999
Javier D. Bruguera; Tomás Lang
This paper describes the design of a leading-one prediction (LOP) logic for floating-point addition with an exact determination of the shift amount for normalization of the adder result. Leading-one prediction is a technique to calculate the number of leading zeros of the result in parallel with the addition. However, the prediction might be in error by one bit and previous schemes to correct this error result in a delay increase. The design presented here incorporates a concurrent position correction logic, operating in parallel with the LOP, to detect the presence of that error and produce the correct shift amount. We describe the error detection as part of the overall LOP, perform estimates of its delay and complexity, and compare with previous schemes.
IEEE Transactions on Computers | 1997
Elisardo Antelo; Julio Villalba; Javier D. Bruguera; Emilio L. Zapata
Traditionally, CORDIC algorithms have employed radix-2 in the first n/2 microrotations (n is the precision in bits) in order to preserve a constant scale factor. The authors present a full radix-4 CORDIC algorithm in rotation mode and circular coordinates and its corresponding selection function, and propose an efficient technique for the compensation of the nonconstant scale factor. Three radix-4 CORDIC architectures are implemented: 1) a word serial architecture based on the zero skipping technique, 2) a pipelined architecture, and 3) an application specific architecture (the angles are known beforehand). The first two are general purpose implementations where redundant (carry-save) or nonredundant arithmetic can be used, whereas the last one is a simplification of the first two. The proposed architectures present a good trade-off between latency and hardware complexity when compared with existing CORDIC architectures.
symposium on computer arithmetic | 2001
José-Alejandro Piñeiro; Javier D. Bruguera; Jean-Michel Muller
A method for the calculation of faithfully rounded single-precision floating-point powering (X/sup p/) is proposed in this paper. This method employs table look-up and a second-degree minimax approximation, which allows the employment of reduced size tables to store the coefficients from the polynomial approximation. A specialized squaring unit and a fused accumulation tree carry out with the computation of the quadratic polynomial. Both unfolded and pipelined architectures are presented, and the results of a pre-layout synthesis performed using CMOS 0.35 /spl mu/m technology are shown, achieving a 50% area reduction from linear approximation methods, and with improved speed over other second-degree approximation based algorithms. The pipelined architecture has a latency of three cycles and a throughput of one result per cycle.
IEEE Transactions on Computers | 2002
Jose ´ Alejandro Pineiro; Javier D. Bruguera
A new method for the high-speed computation of double-precision floating-point reciprocal, division, square root, and inverse square root operations is presented in this paper. This method employs a second-degree minimax polynomial approximation to obtain an accurate initial estimate of the reciprocal and the inverse square root values, and then performs a modified Goldschmidt iteration. The high accuracy of the initial approximation allows us to obtain double-precision results by computing a single Goldschmidt iteration, significantly reducing the latency of the algorithm. Two unfolded architectures are proposed: the first one computing only reciprocal and division operations, and the second one also including the computation of square root and inverse square root. The execution times and area costs for both architectures are estimated, and a comparison with other multiplicative-based methods is presented. The results of this comparison show the achievement of a lower latency than these methods, with similar hardware requirements.
IEEE Transactions on Circuits and Systems for Video Technology | 2006
Roberto R. Osorio; Javier D. Bruguera
New image and video coding standards have pushed the limits of compression by introducing new techniques with high computational demands. The Advanced Video Coder (ITU-T H.264, AVC MPEG-4 Part 10) is the last international standard, which introduces new enhanced features that require new levels of performance. Among the new tools present in AVC, the context-based binary arithmetic coder (CABAC) offers significant compression advantage over baseline entropy coders. CABAC is meant to be used in AVCs Main and High Profiles, which target broadcast and video storage and distribution of standard and high-definition contents. In these applications, hardware acceleration is needed as the computational load of CABAC is high, challenging programmable processors. Moreover, rate-distortion optimization (RDO) increases CABACs load by two orders of magnitude. In this paper, we present a fast and new architecture for arithmetic coding adapted to the characteristics of CABAC, including optimized use of memory and context managing and fast processing able to encode more than two symbols per cycle. A maximum processing speed of 185 MHz has been obtained for 0.35 mu, able to encode high quality video in real time. Some of the proposed optimization may also be applied to software implementations obtaining significant improvements
IEEE Transactions on Computers | 2004
Tomás Lang; Javier D. Bruguera
We propose architecture for the computation of the double-precision floating-point multiply-add-fused (MAP) operation A + (B /spl times/ C). This architecture is based on the combined addition and rounding (using a dual adder) and in the anticipation of the normalization step before the addition. Because the normalization is performed before the addition, it is not possible to overlap the leading-zero-anticipator with the adder. Consequently, to avoid the increase in delay, we modify the design of the LZA so that the leading bits of its output are produced first and can be used to begin the normalization. Moreover, parts of the addition are also anticipated. We have estimated the delay of the resulting architecture considering the load introduced by long connections, and we estimate a delay reduction of between 15 percent and 20 percent, with respect to previous implementations.
IEEE Transactions on Computers | 2004
José-Alejandro Piñeiro; Milos D. Ercegovac; Javier D. Bruguera
An architecture for the computation of logarithm, exponential, and powering operations is presented in this paper, based on a high-radix composite algorithm for the computation of the powering function (X/sup Y/). The algorithm consists of a sequence of overlapped operations: 1) digit-recurrence logarithm, 2) left-to-right carry-free (LRCF) multiplication, and 3) online exponential. A redundant number system is used and the selection in 1) and 3) is done by rounding except from the first iteration, when selection by table look-up is necessary to guarantee the convergence of the recurrences. A sequential implementation of the algorithm, with a control unit which allows the independent computation of logarithm and exponential, is proposed and the execution times and hardware requirements are estimated for single and double-precision floating-point computations. These estimates are obtained for radices from r=8 to r=1,024, according to an approximate model for the delay and area of the main logic blocks and help determining the radix values which lead to the most efficient implementations: r=32 and r=128.
digital systems design | 2004
Roberto R. Osorio; Javier D. Bruguera
In this paper we propose an efficient implementation of CABACs binary arithmetic coder and context management system. CABAC is the context adaptive binary arithmetic coder used in new H.264/AVC video standard. Arithmetic coding allows a significant enhancement in compression. However, implementation complexity is a drawback due to hardware cost and slowness. In this paper we show the need for a hardware implementation of arithmetic coding in current video compression systems. We propose a fast and efficient implementation of the encoding algorithm. We prove that memory accesses constitute a bottleneck and propose solutions that apply to the encoding algorithm and context management system. As a result, a fast architecture is presented, able to process one symbol per cycle.
IEEE Transactions on Communications | 1997
Montserrat Bóo; Francisco Argüello; Javier D. Bruguera; Ramón Doallo; Emilio L. Zapata
The Viterbi (1967) algorithm (VA) is known to be an efficient method for the realization of maximum-likelihood (ML) decoding of convolutional codes. The VA is characterized by a graph, called a trellis, which defines the transitions between states. To define an area efficient architecture for the VA is equivalent to obtaining an efficient mapping of the trellis. We present a methodology that permits the efficient hardware mapping of the VA onto a processor network of arbitrary size. This formal model is employed for the partitioning of the computations among an arbitrary number of processors in such a way that the data are recirculated, optimizing the use of the PEs and the communications. Therefore, the algorithm is mapped onto a column of processing elements and an optimal design solution is obtained for a particular set of area and/or speed constraints. Furthermore, the management of the surviving path memory for its mapping and distribution among the processors was studied. As a result, we obtain a regular and modular design appropriate for its VLSI implementation in which the only necessary communications between processors are the data recirculations between stages.