Frederico Pratas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Frederico Pratas is active.

Explore More

Publication

Featured researches published by Frederico Pratas.

international conference on parallel processing | 2009

Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function

Frederico Pratas; Pedro Trancoso; Alexandros Stamatakis; Leonel Sousa

We are currently faced with the situation where applications have increasing computational demands and there is a wide selection of parallel processor systems. In this paper we focus on exploiting fine-grain parallelism for a demanding Bioinformatics application - MrBayes - and its Phylogenetic Likelihood Functions (PLF) using different architectures. Our experiments compare side-by-side the scalability and performance achieved using general-purpose multi-core processors, the Cell/BE, and Graphics Processor Units (GPU). The results indicate that all processors scale well for larger computation and data sets. Also, GPU and Cell/BE processors achieve the best improvement for the parallel code section. Nevertheless, data transfers and the execution of the serial portion of the code are the reasons for their poor overall performance. The general-purpose multi-core processors prove to be simpler to program and provide the best balance between an efficient parallel and serial execution, resulting in the largest speedup.

IEEE Computer Architecture Letters | 2014

Cache-aware Roofline model: Upgrading the loft

Aleksandar Ilic; Frederico Pratas; Leonel Sousa

The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%.

parallel computing | 2012

Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems

Frederico Pratas; Pedro Trancoso; Leonel Sousa; Alexandros Stamatakis; Guochun Shi; Volodymyr V. Kindratenko

Currently, we are facing a situation where applications exhibit increasing computational demands and where a large variety of parallel processor systems are available. In this paper we focus on exploiting fine-grain parallelism for three applications with distinct characteristics: a Bioinformatics application (MrBayes), a Molecular Dynamics application (NAMD), and a database application (TPC-H). We assess, side-by-side, the performance of the three applications on general-purpose multi-core processors, the Cell Broadband Engine (Cell/BE), and Graphics Processing Units (GPU). Our results indicate that application performance depends on the characteristics of the parallel architectures and on the computational requirements of the core functions of the respective applications. For MrBayes the best overall performance is achieved on general-purpose multi-core processors, for NAMD on the Cell/BE, and for TPC-H on GPUs.

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2009

Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study

Frederico Pratas; Leonel Sousa

To facilitate the design of hardware accelerators we propose in this paper the adoption of the stream-based computing model and the usage of Graphics Processing Units (GPUs) as prototyping platforms. This model exposes the maximum data parallelism available in the applications and decouples computation from memory accesses. The design and implementation procedures, including the programming of GPUs, are illustrated with the widely used MrBayes bioinformatics application. Experimental results show that a straightforward mapping of the stream-based program for the GPU into hardware structures leads to improvements in performance, scalability and cost. Moreover, it is shown that a set of simple optimization techniques can be applied in order to reduce the cost, and the power consumption of hardware solutions.

Computing in Science and Engineering | 2010

Application Acceleration with the Cell Broadband Engine

Guochun Shi; Volodymyr V. Kindratenko; Frederico Pratas; Pedro Trancoso; Michael Gschwind

The Cell Broadband Engine is a heterogeneous chip multiprocessor that combines a PowerPC processor core with eight single-instruction multiple-data accelerator cores and delivers high performance on many computationally intensive codes.

IEEE Transactions on Computers | 2017

Beyond the Roofline: Cache-Aware Power and Energy-Efficiency Modeling for Multi-Cores

Aleksandar Ilic; Frederico Pratas; Leonel Sousa

To foster the energy-efficiency in current and future multi-core processors, the benefits and trade-offs of a large set of optimization solutions must be evaluated. For this purpose, it is often crucial to consider how key micro-architecture aspects, such as accessing different memory levels and functional units, affect the attainable power and energy consumption. To ease this process, we propose a set of insightful cache-aware models to characterize the upper-bounds for power, energy and energy-efficiency of modern multi-cores in three different domains of the processor chip: cores, uncore and package. The practical importance of the proposed models is illustrated when optimizing matrix multiplication and deriving a set of power envelopes and energy-efficiency ranges of the micro-architecture for different operating frequencies. The proposed models are experimentally validated on a computing platform with a quad-core Intel 3770K processor by using hardware counters, on-chip power monitoring facilities and assembly micro-benchmarks.

Computers & Geosciences | 2015

Acceleration of stochastic seismic inversion in OpenCL-based heterogeneous platforms

Tomás Ferreirinha; Ruben Nunes; Leonardo Azevedo; Amílcar Soares; Frederico Pratas; Pedro Tomás; Nuno Roma

Seismic inversion is an established approach to model the geophysical characteristics of oil and gas reservoirs, being one of the basis of the decision making process in the oil&gas exploration industry. However, the required accuracy levels can only be attained by dealing and processing significant amounts of data, often leading to consequently long execution times. To overcome this issue and to allow the development of larger and higher resolution elastic models of the subsurface, a novel parallelization approach is herein proposed targeting the exploitation of GPU-based heterogeneous systems based on a unified OpenCL programming framework, to accelerate a state of art Stochastic Seismic Amplitude versus Offset Inversion algorithm. To increase the parallelization opportunities while ensuring model fidelity, the proposed approach is based on a careful and selective relaxation of some spatial dependencies. Furthermore, to take into consideration the heterogeneity of modern computing systems, usually composed of several and different accelerating devices, multi-device parallelization strategies are also proposed. When executed in a dual-GPU system, the proposed approach allows reducing the execution time in up to 30 times, without compromising the quality of the obtained models. HighlightsNovel approach to accelerate a Stochastic Seismic AVO Inversion algorithm.Exploitation of GPU-based heterogeneous systems based on a unified OpenCL framework.Multi-device parallelization strategies to tackle system heterogeneity.The adopted parallelization strategy ensures the quality of the inversion results.Performance speedup as high as 30i? is obtained with a dual-GPU system.

ieee global conference on signal and information processing | 2013

Open the Gates: Using High-level Synthesis towards programmable LDPC decoders on FPGAs

Frederico Pratas; João Sousa Andrade; Gabriel Falcao; Vitor Silva; Leonel Sousa

State-of-the-art decoders for LDPC codes adopted by several digital communication standards require a significant amount of hardware resources to achieve the desired high throughput performance. With technology scaling below the 22nm and with billions of transistors available per chip/device, the development cost and complexity of such designs represent an increasing challenge for hardware designers tackling these communication algorithms. This paper proposes a new strategy for developing flexible and totally programmable long-length LDPC decoders to target execution on FPGA devices. We exploit Maxelers Java-based technology to describe the LDPC decoder architecture. We compare the performance of this approach with state-of-the-art parallel computing architectures and show that for the most complex family of binary LDPC codes, real-time throughputs in the order of Mbit/s can be achieved with much lower development effort than imposed by RTL descriptions, and with tremendous power savings compared to the powerful GPUs.

field-programmable custom computing machines | 2013

Accelerating the Computation of Induced Dipoles for Molecular Mechanics with Dataflow Engines

Frederico Pratas; Diego Oriato; Oliver Pell; Ricardo A. Mata; Leonel Sousa

In Molecular Mechanics simulations, the treatment of electrostatics is the most computational intensive task. Modern force fields, such as the AMOEBA, which include explicit polarization effects, are particularly computationally demanding. We propose a static dataflow architecture for accelerating polarizable force fields. Results, obtained with Maxelers MaxCompiler, show a speed-up factor of about 14x on a Maxeler 1U MaxNode, when compared to a 12-core CPU node while using half of the dataflow engine capacity. Projections for a full chip implementation indicate that speed-up results of up to 29x per node can be reached. Moreover, our implementation on the Maxeler system shows improvements between 2.5x and 4x compared to NVIDIA Fermibased GPUs. The current work shows the potential of dataflow engines in accelerating this field of applications.

international conference on parallel processing | 2013

Monitoring Performance and Power for Application Characterization with the Cache-Aware Roofline Model

Diogo Antão; Luís Taniça; Aleksandar Ilic; Frederico Pratas; Pedro Tomás; Leonel Sousa

Accurate on-the-fly characterization of application behaviour requires assessing a set of execution-related parameters at runtime, including performance, power and energy consumption. These parameters can be obtained by relying on hardware measurement facilities built-in modern multi-core architectures, such as performance and energy counters. However, current operating systems (OSs) do not provide the means to directly obtain these characterization data. Thus, the user needs to rely on complex custom-built libraries with limited capabilities, which might introduce significant execution and measurement overheads. In this work, we propose two different techniques for efficient performance, power and energy monitoring for systems with modern multi-core CPUs. Here we propose two monitoring tools that allow capturing the run-time behaviour of a wide range of applications at different system levels: (i) at the user-space level, and (ii) at kernel-level, by using the OS scheduler to directly capture this information. Although the importance of the proposed monitoring facilities is patent for many purposes, we focus herein on their employment for application characterization with the recently proposed Cache-aware Roofline model.

Explore More