Rui Policarpo Duarte | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rui Policarpo Duarte is active.

Explore More

Publication

Featured researches published by Rui Policarpo Duarte.

digital systems design | 2009

Double-precision Gauss-Jordan Algorithm with Partial Pivoting on FPGAs

Rui Policarpo Duarte; Horácio C. Neto; Mário P. Véstias

This work presents an architecture to compute matrix inversions in a reconfigurable digital system, benefiting from embedded processing elements present in FPGAs, and using double precision floating point representation. The main module of this system is the processing component for the Gauss-Jordan elimination. This component consists of other smaller arithmetic units, organized in pipeline. These units maintain the accuracy in the results without the need to internally normalize and de-normalize the floating-point data. The implementation of the operations takes advantage of the embedded processing elements available in the Virtex-5 FPGA. This implementation shows performance and resource consumption improvements when compared with “traditional” cascaded implementations of the floating point operators. Benchmarks are done with solutions implemented previously in FPGA and software, such as Matlab and Scilab. Keywords-Matrix inversion; Pivoting; Gauss-Jordan; Floating-point; FPGA;

field programmable logic and applications | 2012

High-level linear projection circuit design optimization framework for FPGAs under over-clocking

Rui Policarpo Duarte; Christos-Savvas Bouganis

Frequently, the high-level algorithm parameter selection and its mapping into hardware are considered to be independent processes, often leading to suboptimal solutions. When DSP applications with real-time constraints are targeted, it is often desirable the resulting hardware system to be clocked at as high frequency as possible. Even though the trend in modern devices is to provide a fabric that can support higher frequencies, its variability makes the design tools to be pessimistic about maximum clock frequency estimates. The proposed framework optimizes and mitigates the probabilistic behaviour of digital circuits, by trying to expose the impact of variability of the fabric into high-level algorithmic specifications. FPGAs are well positioned to tackle this problem because they can be reconfigured, allowing an off-line characterization of the specific device before implementing the complete optimized circuit on the same device. Circuits generated by the proposed framework outperform typical implementations, by minimizing area, errors, and maximizing its operating clock frequency. An example of a linear projection circuit, over-clocked by 232%, shows savings up to 39% in hardware resources for the same target PSNR over traditional implementation.

international parallel and distributed processing symposium | 2014

Over-clocking of Linear Projection Designs through Device Specific Optimisations

Rui Policarpo Duarte; Christos-Savvas Bouganis

Frequently, applications such as image and video processing rely on implementations of the Linear Projection algorithm with high throughput and low latency requirements. This work presents a framework to optimise Linear Projection designs that excel typical design implementations via a pre-characterisation of over-clocked arithmetic units. It is well known that the delay models used by synthesis tools are generic and tuned for the worst performance possible of a given fabrication process. Hence, they impose a heavy penalty in the possible maximum performance offered by the fabrication process. The proposed optimisation framework focuses on the optimisation of the generic multipliers, as they are the arithmetic operators with the most critical paths in the data path of a linear projection design, by performing a performance characterisation step on the target device. Experiments demonstrate that the proposed framework is able to generate Linear Projection designs that achieve higher throughput (up to 1.85 times) while producing less errors than typical implementation methodologies.

applied reconfigurable computing | 2014

A Unified Framework for Over-Clocking Linear Projections on FPGAs under PVT Variation

Rui Policarpo Duarte; Christos-Savvas Bouganis

Linear Projection is a widely used algorithm often implemented with high throughput requisites. This work presents a novel methodology to optimise Linear Projection designs that outperform typical design methodologies through a prior characterisation of the arithmetic units in the data path of the circuit under various operating conditions. Limited by the ever increasing process variation, the delay models available in synthesis tools are no longer suitable for performance optimisation of designs, as they are generic and only take into account the worst case variation for a given fabrication process. Hence, they heavily penalise the optimisation strategy of a design by leaving a gap in performance. This work presents a novel unified optimisation framework which contemplates a prior characterisation of the embedded multipliers on the target device under PVT variation. The proposed framework creates designs that achieve high throughput while producing less errors than typical methodologies. The results of a case study reveal that the proposed methodology outperforms the typical implementation in 3 real-life design strategies: high performance, low power and temperature variation. The proposed methodology produced Linear Projection designs that were able to perform up to 18 dB better than the reference methodology.

field programmable logic and applications | 2015

Enhancing stochastic computations via process variation

Rui Policarpo Duarte; Mário P. Véstias; Horácio C. Neto

Stochastic computing has emerged as a computational paradigm that offers arithmetic operators with high-performance, compact implementations and robust to errors by producing approximate results. This work addresses two of the major limitations for its implementation which affects its accuracy: the correlation between stochastic bitstreams and the unobserved signal transitions. A novel implementation of stochastic arithmetic building-blocks is proposed to improve the quality of the results. It relies on Self-Timed Ring-Oscillators to produce different clock signals with different clock frequencies, by taking advantage of the influence of process variation in the timing of the logic elements on the FPGA. This work also presents an automated test platform for stochastic systems, which was used to evaluate the impact of the proposed enhancements. Tests were performed to compare both proposed and typical implementations, on reconfigurable devices with 28nm and 60nm fabrication processes. Finally, presented results demonstrate that the proposed architectures subjected to the impact of process variation improve the quality of the results.

european conference on machine learning | 2015

CardioWheel: ECG Biometrics on the Steering Wheel

André Lourenço; Ana Priscila Alves; Carlos Carreiras; Rui Policarpo Duarte; Ana L. N. Fred

Monitoring physiological signals while driving is a recent trend in the automotive industry. We present CardioWheel, a state-of-the-art machine learning solution for driver biometrics based on electrocardiographic signals ECG. The presented system pervasively acquires heart signals from the users hands through sensors embedded in the steering wheel, to recognize the drivers identity. It combines unsupervised and supervised machine learning algorithms, and is being tested in real-world scenarios, illustrating one of the potential uses of this technology.

reconfigurable computing and fpgas | 2014

Zero-latency datapath error correction framework for over-clocking DSP applications on FPGAs

Rui Policarpo Duarte; Christos-Savvas Bouganis

Errors in the datapath of digital systems usually come with a cost that can be very expensive, either as a consequence of uncertain functionality, or extra resources required to implement mitigation mechanisms, and extra latency to recover from errors. In this work we propose and demonstrate a novel framework which allows to recover from timing errors on a DSP application under extreme over-clocking without adding extra latency into the circuit. Demonstration of the proposed framework on a real-life image processing problem shows an improvement, on average, of 20 dB over typical implementations for doubling of the operating clock frequency.

field programmable gate arrays | 2014

Pushing the performance boundary of linear projection designs through device specific optimisations (abstract only)

Rui Policarpo Duarte; Christos-Savvas Bouganis

The continuous scaling of the fabrication process combined with the ever increasing need of high performance designs, means that the era of treating all devices the same is about to come to an end. The presented work considers device oriented optimisations in order to further boost the performance of a Linear Projection design by focusing on the over-clocking of arithmetic operators. A methodology is proposed for the acceleration of Linear Projection designs on an FPGA, that introduces information about the performance of the hardware under over-clocking conditions to the application level. The novelty of this method is a pre-characterisation of the most prone to error arithmetic operators and the utilisation of this information in the high-level optimization process of the design. This results in a set of circuit designs that achieve higher throughput with minimum error. FPGA devices are suitable for such optimisations due to their reconfigurability feature that allows performance characterisation of the underlying fabric prior to the design of the final system. The reported results show that significant gains in the performance of the system can be achieved, i.e. up to 1.85 times speed up in the throughput compared to existing methodologies, when such device specific optimisation is considered.

applied reconfigurable computing | 2008

Multiplier-Based Double Precision Floating Point Divider According to the IEEE-754 Standard

Vítor Silva; Rui Policarpo Duarte; Mário P. Véstias; Horácio C. Neto

This paper describes the design and implementation of a unit to calculate the significand of a double precision floating point divider according to the IEEE-754 standard. Instead of the usual digit recurrence techniques, such as SRT-2 and SRT-4, it uses an iterative technique based on the Goldsmith algorithm. As multiplication is the main operation of this algorithm, its implementation is able to take advantage of the efficiency of the embedded multipliers available in the FPGAs. The results obtained indicate that the multiplier-based iterative algorithms can achieve better performance than the alternative digit recurrence algorithms, at the cost of some area overhead.

ieee international conference on high performance computing data and analytics | 2018

Hyperspectral compressive sensing: a low-power consumption approach

José M. P. Nascimento; Mário P. Véstias; Rui Policarpo Duarte

Hyperspectral imaging instruments allow data collection in hundreds of spectral bands for the same area on the surface of the Earth. The resulting multidimensional data cube typically comprises several GBs per ight. Due to the extremely large volumes of data collected by imaging spectrometers, hyperspectral data compression, dimensionality reduction and Compressive Sensing (CS) techniques has received considerable interest in recent years. These data are usually acquired by a satellite or an airbone instrument and sent to a ground station on Earth for subsequent processing. Usually the bandwidth connection between the satellite/airborne platform and the ground station is reduced, which limits the amount of data that can be transmitted. As a result, there is a clear need for (either lossless or lossy) hyperspectral data compression techniques that can be applied on-board the imaging instrument. This paper, presents a study of the power and time consumption and accuracy of a parallel implementation for a spectral compressive acquisition method on a Jetson TX2 platform, which is well suited to perform vector operations such as dot products. This implementation exploits the architecture at low level, using shared memory and coalesced accesses to memory. The conducted experiments have been performed to demonstrate the applicability, in terms of accuracy, time consuming and power consumption of these methods for onboard processing. The results show that by using this low power consumption GPU is it possible to obtain real-time performance with a very limited power requirement.

Explore More