Elisardo Antelo
University of Santiago de Compostela
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Elisardo Antelo.
symposium on computer arithmetic | 2007
Alvaro Vazquez; Elisardo Antelo; Paolo Montuschi
This paper introduces two novel architectures for parallel decimal multipliers. Our multipliers are based on a new algorithm for decimal carry-save multioperand addition that uses a novel BCD-4221 recoding for decimal digits. It significantly improves the area and latency of the partial product reduction tree with respect to previous proposals. We also present three schemes for fast and efficient generation of partial products in parallel. The recoding of the BCD-8421 multiplier operand into minimally redundant signed-digit radix-10, radix-4 and radix-5 representations using new recoders reduces the complexity of partial product generation. In addition, SD radix-4 and radix-5 recodings allow the reuse of a conventional parallel binary radix-4 multiplier to perform combined binary/decimal multiplications. Evaluation results show that the proposed architectures have interesting area-delay figures compared to conventional Booth radix-4 and radix-8 parallel binary multipliers and other representative alternatives for decimal multiplication.
IEEE Transactions on Computers | 1997
Elisardo Antelo; Julio Villalba; Javier D. Bruguera; Emilio L. Zapata
Traditionally, CORDIC algorithms have employed radix-2 in the first n/2 microrotations (n is the precision in bits) in order to preserve a constant scale factor. The authors present a full radix-4 CORDIC algorithm in rotation mode and circular coordinates and its corresponding selection function, and propose an efficient technique for the compensation of the nonconstant scale factor. Three radix-4 CORDIC architectures are implemented: 1) a word serial architecture based on the zero skipping technique, 2) a pipelined architecture, and 3) an application specific architecture (the angles are known beforehand). The first two are general purpose implementations where redundant (carry-save) or nonredundant arithmetic can be used, whereas the last one is a simplification of the first two. The proposed architectures present a good trade-off between latency and hardware complexity when compared with existing CORDIC architectures.
IEEE Transactions on Computers | 2010
Alvaro Vazquez; Elisardo Antelo; Paolo Montuschi
The new generation of high-performance decimal floating-point units (DFUs) is demanding efficient implementations of parallel decimal multipliers. In this paper, we describe the architectures of two parallel decimal multipliers. The parallel generation of partial products is performed using signed-digit radix-10 or radix-5 recodings of the multiplier and a simplified set of multiplicand multiples. The reduction of partial products is implemented in a tree structure based on a decimal multioperand carry-save addition algorithm that uses unconventional (non BCD) decimal-coded number systems. We further detail these techniques and present the new improvements to reduce the latency of the previous designs, which include: optimized digit recoders for the generation of 2n-tuples (and 5-tuples), decimal carry-save adders (CSAs) combining different decimal-coded operands, and carry-free adders implemented by special designed bit counters. Moreover, we detail a design methodology that combines all these techniques to obtain efficient reduction trees with different area and delay trade-offs for any number of partial products generated. Evaluation results for 16-digit operands show that the proposed architectures have interesting area-delay figures compared to conventional Booth radix-4 and radix--8 parallel binary multipliers and outperform the figures of previous alternatives for decimal multiplication.
IEEE Transactions on Computers | 2005
Tomáas Lang; Elisardo Antelo
Graphics processors require strong arithmetic support to perform computational kernels over data streams. Because of the current implementation using the basic arithmetic operations, the algorithms are given in algebraic terms. However, since the operations are really of a geometric nature, it seems to us that more flexibility in the implementation is obtained if the description is given in a high-level geometrical form. As a consequence of this line of thought, this paper is an attempt to reconsider some kernels in a graphics processor to obtain implementations that are potentially more scalable than just replicating the modules used in conventional implementations. We present the formulation of representative 3D computer graphics operations in terms of CORDIC-type primitives. Then, we briefly outline a stream processor based on CORDIC-type modules to efficiently implement these graphic operations. We perform a rough comparison with current implementations and conclude that the CORDIC-based alternative might be attractive.
international conference on application specific array processors | 1995
Julio Villalba; J. A. Hidalgo; E.L. Zapata; Elisardo Antelo; Javier D. Bruguera
The compensation of scale factor imposes significant computation overhead on the CORDIC algorithm. In this paper we will propose two algorithms and architectures in order to perform the compensation of the scale factor in parallel with the computation of the CORDIC iterations. This way it is not necessary to carry out the final multiplication or add scaling iterations in order to achieve the compensation. With the architectures we propose the dependence on n of the compensation of the scale factor disappears, and this considerably reduces the latency of the system. The architectures developed are optimized solutions for the different operating modes of the CORDIC both in conventional and in redundant arithmetic.
IEEE Transactions on Computers | 2011
Fabrizio Lamberti; Nikolaos Andrikos; Elisardo Antelo; Paolo Montuschi
Twos complement multipliers are important for a wide range of applications. In this paper, we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modified Booth Encoded multiplier, without any increase in the delay of the partial product generation stage. This reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiplier designs, but especially in short bit-width twos complement multipliers for high-performance embedded cores. The proposed method is general and can be extended to higher radix encodings, as well as to any size square and m \times n rectangular multipliers. We evaluated the proposed approach by comparison with some other possible solutions; the results based on a rough theoretical analysis and on logic synthesis showed its efficiency in terms of both area and delay.
IEEE Transactions on Computers | 2005
Elisardo Antelo; Tomas Lang; Paolo Montuschi; Alberto Nannarelli
In this paper, we propose a class of division algorithms with the aim of reducing the delay of the selection of the quotient digit by introducing more concurrency and flexibility in its computation. From the proposed class of algorithms, we select one that moves part of the selection function out of the critical path, with a corresponding reduction in the critical path compared with existing alternatives: we present the algorithm and describe the architectures for radix 4 and for radix 16. For radix 16, we use the scheme of overlapping two radix-4 stages. In both cases, radix 4 and radix 16, we show that our algorithms allow the design of units with well-balanced critical paths with consequent decreases of the cycle times. Moreover, in the radix-16 case, we include some additional speculation techniques. To estimate the speedup, we used a rough timing model based on logical effort. For both radices, we estimate a speedup of about 25 percent with respect to previous implementations. In the radix-4 case, this is achieved by using roughly the same area, while, in the radix-16 case, the area is increased by about 30 percent. We verified our estimations by performing a synthesis of the radix-4 units.
IEEE Transactions on Computers | 1998
Elisardo Antelo; Tomás Lang; Javier D. Bruguera
A very-high radix digit-recurrence algorithm for the operation /spl radic/(x/d) is developed, with residual scaling and digit selection by rounding. This is an extension of the division and square-root algorithms presented previously, and for which a combined unit was shown to provide a fast execution of these operations. The architecture of a combined unit to execute division, square-root, and /spl radic/(x/d) is described, with inverse square-root as a special case. A comparison with the corresponding combined division and square-root unit shows a similar cycle time and an increase of one cycle for the extended operation with respect to square-root. To obtain an exactly rounded result for the extended operation a datapath of about 2n bits is needed. An alternative is proposed which requires approximately the same width as for square-root, but produces a result with an error of less than one ulp. The area increase with respect to the division and square root unit should be no greater than 15 percent. Consequently, whenever a very high radix unit for division and square-root seems suitable, it might be profitable to implement the extended unit instead.
international conference on computer design | 2007
Alvaro Vazquez; Elisardo Antelo; Paolo Montuschi
In this paper we present the algorithm and architecture a radix-10 floating-point divider based on an SRT non-restoring digit-by-digit algorithm. The algorithm uses conventional techniques developed to speed-up radix-2k division such as signed-digit (SD) redundant quotient and digit selection by constant comparison using a carry-save estimate of the partial remainder. To optimize area and latency for decimal, we include novel features such as the use of alternative BCD codings to represent decimal operands, estimates by truncation at any binary position inside a decimal digit, a single customized fast carry propagate decimal adder for partial remainder computation, initial odd multiple generation and final normalization with rounding, and register placement to exploit advanced high fanin mux-latch circuits. The rough area-delay estimations performed show that the proposed divider has a similar latency but less hardware complexity (1.3 area ratio) than a recently published high performance digit-by-digit implementation.
signal processing systems | 2000
Elisardo Antelo; Tomás Lang; Javier D. Bruguera
A very-high radix algorithm and implementation for CORDIC rotation in circular and hyperbolic coordinates is presented. The selection function consists of rounding the residual. It is shown that this assures convergence from the second iteration on. For the first iteration, the selection is done by table, using a lower radix than for the remaining iterations. The compensation of the variable scale factor is done by computing the logarithm of the scale factor and performing the compensation by an exponential. Estimations of the delay for 32-bit and 64-bit precision show a substantial speed up when compared to low radix implementations. The proposed algorithm is also compared with previously proposed very-high radix ones, and significant advantages are identified.