Horácio C. Neto
Instituto Superior Técnico
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Horácio C. Neto.
IEEE Design & Test of Computers | 2003
João M. P. Cardoso; Horácio C. Neto
This paper provides techniques for compiling software programs into reconfigurable hardware which offer faster and more efficient performance than the complex resource-sharing approaches typical of high-level synthesis systems. The Java-based compiler presented in this paper uses intermediate graph representations to embody parallelism at various levels.
IFIP Working Conference on Distributed and Parallel Embedded Systems | 2008
Rui Marcelino; Horácio C. Neto; João M. P. Cardoso
Sorting is an important operation for a number of embedded applications. As sorting large datasets may impose undesired performance degradation, acceleration units coupled to the embedded processor can be an interesting solution for speeding-up the computations. This paper presents and evaluates three hardware sorting units, bearing in mind embedded computing systems implemented with FPGAs. The proposed architectures take advantage of specific FPGA hardware resources to increase efficiency. Experimental results show the differences in resources and performances among the three proposed sorting units and also between the sorting units and pure software implementations for sorting.We show that a hybrid between an insertion sorting unit and a merge FIFO sorting unit provides a speed-up between 1.6 and 25 compared to a quicksort software implementation.
field-programmable logic and applications | 2008
Horácio C. Neto; Mário P. Véstias
Decimal arithmetic has become a major necessity in computer arithmetic operations associated with human-centric applications, like financial and commercial, because the results must match exactly those obtained by human calculations. The relevance of decimal arithmetic has become evident with the revision of the IEEE-754 standard to include decimal floating-point support. There are already a variety of IP cores available for implementing binary arithmetic accelerators in FPGAs. Thus far, however, little work has been done with regard to implementing cores that work with decimal arithmetic. In this paper, we introduce a novel approach to the design of a decimal multiplier in FPGA using the embedded arithmetic blocks and a novel method for binary to BCD conversion. The proposed circuits were implemented in a Xilinx Virtex 4sx35ff877-12 FPGA. The results indicate that the proposed binary to BCD converter is more efficient than the traditional shift and add-3 algorithm and that the proposed decimal multiplier is very competitive when compared to decimal multipliers implemented with direct manipulation of BCD numbers.
2010 VI Southern Programmable Logic Conference (SPL) | 2010
Mário P. Véstias; Horácio C. Neto
Human-centric applications, like financial and commercial, depend on decimal arithmetic since the results must match exactly those obtained by human calculations. The IEEE-754 2008 standard for floating point arithmetic has definitely recognized the importance of decimal for computer arithmetic. A number of hardware approaches have already been proposed for decimal arithmetic operations, including addition, subtraction, multiplication and division. However, few efforts have been done to develop decimal IP cores able to take advantage of the binary multipliers available in most reconfigurable computing architectures. In this paper, we analyze the tradeoffs involved in the design of a parallel decimal multiplier, for decimal operands with 8 and 16 digits, using existent coarse-grained embedded binary arithmetic blocks. The proposed circuits were implemented in a Xilinx Virtex 4 FPGA. The results indicate that the proposed parallel multipliers are very competitive when compared to decimal multipliers implemented with direct manipulation of BCD numbers.
great lakes symposium on vlsi | 1999
Paulo F. Flores; Horácio C. Neto; Joao Marques-Silva
Test set compaction is fundamental problem in digital system testing. In recent years, many competitive solutions have been proposed, most of which based on heuristics approaches. This paper studies the application of set covering models to the compaction of test sets, which can be used with any heuristic test set compaction procedure. For this purpose, recent and highly effective set covering algorithms are used. Experimental evidence suggests that the size of computed test sets can often be reduced by using set covering models and algorithms. Moreover a noteworthy empirical conclusion is that it may be preferable not to use fault simulation when the final objective is test set compaction.
IEEE Transactions on Very Large Scale Integration Systems | 1999
João M. P. Cardoso; Horácio C. Neto
This paper presents a novel algorithm for temporal partitioning of graphs representing a behavioral description. The algorithm is based on an extension of the traditional static-list scheduling that tailors it to resolve both scheduling and temporal partitioning. The nodes to be mapped into a partition are selected based on a statically computed cost model. The cost for each node integrates communication effects, the critical path length, and the possibility of the critical path to hide the delay of parallel nodes. In order to alleviate the runtime there is no dynamic update of the costs. A comparison of the algorithm to other schedulers and with close-to-optimum results obtained with a simulated annealing approach is shown. The presented algorithm has been implemented and the results show that it is robust, effective, and efficient, and when compared to other methods finds very good results in small amounts of CPU time.
symposium on integrated circuits and systems design | 2006
Mário P. Véstias; Horácio C. Neto
Complex Systems-on-Chip (SoC) with multiple interconnected stand-alone designs require high scalability and bandwidth. Network-on-Chip (NoC) is a scalable communication infrastructure able to tackle the communication needs of these SoCs. In this paper, we consider the optimization of a generic NoC to improve area and performance of NoC based architectures for dedicated applications. The generic NoC can be tailored to an application by changing the number of routers, by configuring each router to specific traffic requirements, and by choosing the set of links between routers and cores. The optimization algorithm determines the appropriate NoC and routers configuration to support a set of applications considering the optimization of area, and performance. The final solution will consist of an heterogeneous NoC with improved quality. The approach has been tested under different operating conditions assuming implementations on an FPGA.
digital systems design | 2009
Rui Policarpo Duarte; Horácio C. Neto; Mário P. Véstias
This work presents an architecture to compute matrix inversions in a reconfigurable digital system, benefiting from embedded processing elements present in FPGAs, and using double precision floating point representation. The main module of this system is the processing component for the Gauss-Jordan elimination. This component consists of other smaller arithmetic units, organized in pipeline. These units maintain the accuracy in the results without the need to internally normalize and de-normalize the floating-point data. The implementation of the operations takes advantage of the embedded processing elements available in the Virtex-5 FPGA. This implementation shows performance and resource consumption improvements when compared with “traditional” cascaded implementations of the floating point operators. Benchmarks are done with solutions implemented previously in FPGA and software, such as Matlab and Scilab. Keywords-Matrix inversion; Pivoting; Gauss-Jordan; Floating-point; FPGA;
field-programmable logic and applications | 2006
G.M. de Matos; Horácio C. Neto
This paper presents an analysis of the performance of the Gauss-Jordan matrix inversion algorithm on reconfigurable hardware platforms. The results show that currently available reconfigurable computing technology can already achieve significantly higher floating-point performance than CPUs for large matrices operations. For common reconfigurable systems, where the FPGAs are directly coupled to the on-board memory, the achievable performance scales directly with the number of realizable simultaneous memory accesses. A dedicated reconfigware architecture has been implemented and tested on a standard commercial reconfigurable platform. Even though the platform used had limited data bandwidth between the FPGA and the on-board memory, speed improvements of more than 3times have been observed, in comparison with software solutions running in high-end CPUs, for the inversion of matrices of orders up to 1700
field programmable logic and applications | 2014
Mário P. Véstias; Horácio C. Neto
Floating-point computing with more than one TFLOP of peak performance is already a reality in recent Field-Programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floating-point calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems.