Mário P. Véstias
INESC-ID
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mário P. Véstias.
field-programmable logic and applications | 2008
Horácio C. Neto; Mário P. Véstias
Decimal arithmetic has become a major necessity in computer arithmetic operations associated with human-centric applications, like financial and commercial, because the results must match exactly those obtained by human calculations. The relevance of decimal arithmetic has become evident with the revision of the IEEE-754 standard to include decimal floating-point support. There are already a variety of IP cores available for implementing binary arithmetic accelerators in FPGAs. Thus far, however, little work has been done with regard to implementing cores that work with decimal arithmetic. In this paper, we introduce a novel approach to the design of a decimal multiplier in FPGA using the embedded arithmetic blocks and a novel method for binary to BCD conversion. The proposed circuits were implemented in a Xilinx Virtex 4sx35ff877-12 FPGA. The results indicate that the proposed binary to BCD converter is more efficient than the traditional shift and add-3 algorithm and that the proposed decimal multiplier is very competitive when compared to decimal multipliers implemented with direct manipulation of BCD numbers.
2010 VI Southern Programmable Logic Conference (SPL) | 2010
Mário P. Véstias; Horácio C. Neto
Human-centric applications, like financial and commercial, depend on decimal arithmetic since the results must match exactly those obtained by human calculations. The IEEE-754 2008 standard for floating point arithmetic has definitely recognized the importance of decimal for computer arithmetic. A number of hardware approaches have already been proposed for decimal arithmetic operations, including addition, subtraction, multiplication and division. However, few efforts have been done to develop decimal IP cores able to take advantage of the binary multipliers available in most reconfigurable computing architectures. In this paper, we analyze the tradeoffs involved in the design of a parallel decimal multiplier, for decimal operands with 8 and 16 digits, using existent coarse-grained embedded binary arithmetic blocks. The proposed circuits were implemented in a Xilinx Virtex 4 FPGA. The results indicate that the proposed parallel multipliers are very competitive when compared to decimal multipliers implemented with direct manipulation of BCD numbers.
symposium on integrated circuits and systems design | 2006
Mário P. Véstias; Horácio C. Neto
Complex Systems-on-Chip (SoC) with multiple interconnected stand-alone designs require high scalability and bandwidth. Network-on-Chip (NoC) is a scalable communication infrastructure able to tackle the communication needs of these SoCs. In this paper, we consider the optimization of a generic NoC to improve area and performance of NoC based architectures for dedicated applications. The generic NoC can be tailored to an application by changing the number of routers, by configuring each router to specific traffic requirements, and by choosing the set of links between routers and cores. The optimization algorithm determines the appropriate NoC and routers configuration to support a set of applications considering the optimization of area, and performance. The final solution will consist of an heterogeneous NoC with improved quality. The approach has been tested under different operating conditions assuming implementations on an FPGA.
digital systems design | 2009
Rui Policarpo Duarte; Horácio C. Neto; Mário P. Véstias
This work presents an architecture to compute matrix inversions in a reconfigurable digital system, benefiting from embedded processing elements present in FPGAs, and using double precision floating point representation. The main module of this system is the processing component for the Gauss-Jordan elimination. This component consists of other smaller arithmetic units, organized in pipeline. These units maintain the accuracy in the results without the need to internally normalize and de-normalize the floating-point data. The implementation of the operations takes advantage of the embedded processing elements available in the Virtex-5 FPGA. This implementation shows performance and resource consumption improvements when compared with “traditional” cascaded implementations of the floating point operators. Benchmarks are done with solutions implemented previously in FPGA and software, such as Matlab and Scilab. Keywords-Matrix inversion; Pivoting; Gauss-Jordan; Floating-point; FPGA;
field programmable logic and applications | 2014
Mário P. Véstias; Horácio C. Neto
Floating-point computing with more than one TFLOP of peak performance is already a reality in recent Field-Programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floating-point calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems.
southern conference programmable logic | 2011
Mário P. Véstias; Horácio C. Neto
The IEEE-754 2008 standard for floating point arithmetic has definitely dictated the importance of decimal arithmetic. Human-centric applications, like financial and commercial, depend on decimal arithmetic since the results must match exactly those obtained by human calculations. A few hardware approaches have been proposed for decimal arithmetic, including addition, subtraction, multiplication and division. Parallel implementations for these operations are very expensive in terms of occupied resources and therefore implementations based on iterative algorithms are good alternatives. In this paper, we propose an iterative decimal multiplier for FPGA that uses binary arithmetic. The circuits were implemented in a Xilinx Virtex 4 FPGA. The results indicate that the proposed iterative multipliers are very competitive when compared to decimal multipliers implemented with direct manipulation of BCD numbers.
field-programmable logic and applications | 2007
Mário P. Véstias; Horácio C. Neto
A fundamental objective in the design of a network-on-chip is to minimize its area and power consumption while keeping the performance requirements at acceptable levels. The trade-offs involved in the process depend on the target technology, ASIC or FPGA. This paper presents a novel design approach to customize the routers in a network-on-chip for reconfigurable systems. More specifically, given a topology and the traffic requirements, the design process automatically finds the architecture of each router, adjusting the size of the buffers and the configuration of the switch matrix, such that the overall area and performance are maximized. The results indicate that the proposed algorithm can provide significantly better solutions compared to the uniform router design, which is typically used.
digital systems design | 2012
Mário P. Véstias; Horácio C. Neto; Helena Sarmento
The Viterbi algorithm is one of the most popular algorithms for decoding convolutional codes. Implementing a high-speed Viterbi decoder is a challenging task due to the recursive iteration of an add-compare-select operation. In this paper, we propose and analyze several optimization techniques to improve the area/performance tradeoffs of high speed Viterbi decoders on Virtex-6 FPGAs. Both Radix-2, Radix-4 and a modified radix-4 add-compare-select units are implemented with these techniques. The implementation reports for a Virtex-6 FPGA indicate that the proposed techniques achieve very efficient designs of Viterbi decoders in terms of performance and area. 360 Mbps are achievable with radix-2 solutions, while radix-4 solutions can achieve 430 Mbps, better than previous state-of-the-art solutions. Higher data rates can only be achieved with other parallelization techniques, like the sliding block method.
digital systems design | 2009
Victor Silva; Luis B. Oliveira; Jorge R. Fernandes; Mário P. Véstias; Horácio C. Neto
This paper presents the implementation of a coarse-grained Magnetic RAM based Reconfigurable Array. The Reconfigurable Array architecture is organized as a one- dimensional array of programmable ALU, with the configura- tion bits stored in magnetic random-access memories. The use of MRAM technology to implement run-time reconfigurable hardware devices is a very promising technological solution because MRAM can provide non-volatility with cell areas and access speeds comparable to those of SRAM, and with lower process complexity than flash memory. This type of coarse- grained array, where each reconfigurable element computes on 4-bit or larger input words, is more suitable to execute data-oriented algorithms and is more able to exploit larger amounts of operation-level parallelism than common fine- grained architectures. By substantially reducing the overhead for configurability, this coarse-grain architecture is also more apt to efficiently exploit run-time reconfiguration and therefore to take advantage of multi-context MRAM-based configuration memories. Keywords-reconfigurable array; MRAM; programmable fab- rics;
asia and south pacific design automation conference | 2006
Mário P. Véstias; Horácio C. Neto
The constant increase of gate capacity and performance of configurable hardware chips made it possible to implement systems-on-chip (SoC) able to tackle the demanding requirements of many embedded systems. In this paper, we propose an approach to the design space exploration of a configurable SoC (CSoC) platform based on a network on chip (NoC) architecture for the execution of dataflow dominated embedded systems. The approach has been validated with the design of a color image compression algorithm in an FPGA