Gabriel Falcao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gabriel Falcao is active.

Explore More

Publication

Featured researches published by Gabriel Falcao.

IEEE Transactions on Parallel and Distributed Systems | 2011

Massively LDPC Decoding on Multicore Architectures

Gabriel Falcao; Leonel Sousa; Vitor Silva

Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed in this paper to perform LDPC decoding on multicore architectures. To evaluate the efficiency of the proposed parallel algorithms, LDPC decoders were developed on recent multicores, such as off-the-shelf general-purpose x86 processors, Graphics Processing Units (GPUs), and the CELL Broadband Engine (CELL/B.E.). Challenging restrictions, such as memory access conflicts, latency, coalescence, or unknown behavior of thread and block schedulers, were unraveled and worked out. Experimental results for different code lengths show throughputs in the order of 1 ~ 2 Mbps on the general-purpose multicores, and ranging from 40 Mbps on the GPU to nearly 70 Mbps on the CELL/B.E. The analysis of the obtained results allows to conclude that the CELL/B.E. performs better for short to medium length codes, while the GPU achieves superior throughputs with larger codes. They achieve throughputs that in some cases approach very well those obtained with VLSI decoders. From the analysis of the results, we can predict a throughput increase with the rise of the number of cores.

IEEE Transactions on Biomedical Engineering | 2012

Erratum to “A New Solution for Camera Calibration and Real-Time Image Distortion Correction in Medical Endoscopy–Initial Technical Evaluation” [Mar 12 634-644]

Rui Melo; João Pedro Barreto; Gabriel Falcao

Medical endoscopy is used in a wide variety of diagnostic and surgical procedures. These procedures are renowned for the difficulty of orienting the camera and instruments inside the human body cavities. The small size of the lens causes radial distortion of the image, which hinders the navigation process and leads to errors in depth perception and object morphology. This article presents a complete software-based system to calibrate and correct the radial distortion in clinical endoscopy in real time. Our system can be used with any type of medical endoscopic technology, including oblique-viewing endoscopes and HD image acquisition. The initial camera calibration is performed in an unsupervised manner from a single checkerboard pattern image. For oblique-viewing endoscopes the changes in calibration during operation are handled by a new adaptive camera projection model and an algorithm that infer the rotation of the probe lens using only image information. The workload is distributed across the CPU and GPU through an optimized CPU+GPU hybrid solution. This enables real-time performance, even for HD video inputs. The system is evaluated for different technical aspects, including accuracy of modeling and calibration, overall robustness, and runtime profile. The contributions are highly relevant for applications in computer-aided surgery and image-guided intervention such as improved visualization by image warping, 3-D modeling, and visual SLAM.

international conference on supercomputing | 2009

How GPUs can outperform ASICs for fast LDPC decoding

Gabriel Falcao; Vitor Silva; Leonel Sousa

Due to huge computational requirements, powerful Low-Density Parity-Check (LDPC) error correcting codes, discovered in the early 1960s, have only recently been adopted by emerging communication standards. LDPC decoders are supported by VLSI technology, which delivers good parallel computational power with excellent throughputs, but at the expense of significant costs. In this work, we propose an alternative flexible LDPC decoder that exploits data-parallelism for simultaneous multicodeword decoding, supported by multithreading on CUDA-based graphics processing units (GPUs). The ratio of arithmetic operations per memory access is low for the efficient min-sum LDPC decoding algorithm proposed, which causes a bottleneck due to memory latency and data collisions. We propose runtime data realignment to allow coalesced parallel memory accesses to be performed by distinct threads inside the same warp. The memory access patterns of LDPC codes are random, which does not admit the simultaneous use of coalescence in both read and write operations of the decoding process. To overcome this problem we have developed a data mapping transformation which allows new addresses to be contiguously accessed for one of the mentioned memory access types. Our implementation shows throughputs above 100Mbps and BER curves that compare well with ASIC solutions.

acm sigplan symposium on principles and practice of parallel programming | 2008

Massive parallel LDPC decoding on GPU

Gabriel Falcao; Leonel Sousa; Vitor Silva

Low-Density Parity-Check (LDPC) codes are powerful error correcting codes (ECC). They have recently been adopted by several data communication standards such as DVB-S2 and WiMax. LDPCs are represented by bipartite graphs, also called Tanner graphs, and their decoding demands very intensive computation. For that reason, VLSI dedicated architectures have been investigated and developed over the last few years. This paper proposes a new approach for LDPC decoding on graphics processing units (GPUs). Efficient data structures and an new algorithm are proposed to represent the Tanner graph and to perform LDPC decoding according to the stream-based computing model. GPUs were programmed to efficiently implement the proposed algorithms by applying data-parallel intensive computing. Experimental results show that GPUs perform LDPC decoding nearly three orders of magnitude faster than modern CPUs. Moreover, they lead to the conclusion that GPUs with their tremendous processing power can be considered as a consistent alternative to state-of-the-art hardware LDPC decoders.

global communications conference | 2007

Flexible Parallel Architecture for DVB-S2 LDPC Decoders

Marco Gomes; Gabriel Falcao; Vitor Silva; Vitor Ferreira; Alexandre Sengo; Miguel Falcão

State-of-the-art decoders for Low-Density Parity-Check (LDPQ codes adopted by the DVB-S2 standard, explore the periodicity M = 360 features of the selected special LDPC- IRA codes. This paper addresses the generalization of a well known M-kernel parallel hardware structure and proposes an efficient partitioning by any factor of M, without memory addressing overhead and keeping unchanged the efficient message mapping scheme. The method provides a simple and efficient way to reduce the decoder complexity. Synthesizing the proposed decoder architecture for N = {45,90,180} parallel processing units using an FPGA family from Xilinx shows a minimum throughput above the minimal 90 Mbps.

IEEE Signal Processing Magazine | 2012

Portable LDPC Decoding on Multicores Using OpenCL [Applications Corner]

Gabriel Falcao; Vitor Silva; Leonel Sousa; João Sousa Andrade

This article proposes to address, in a tutorial style, the benefits of using Open Computing Language [1] (OpenCL) as a quick way to allow programmers to express and exploit parallelism in signal processing algorithms, such as those used in error-correcting code systems. In particular, we will show how multiplatform kernels can be developed straightforwardly using OpenCL to perform computationally intensive low-density parity-check (LDPC) decoding, targeting them to run on a large set of worldwide disseminated multicore architectures, such as x86 general purpose multicore central processing units (CPUs) and graphics processing units (GPUs). Moreover, devices with different architectures can be orchestrated to cooperatively execute these signal processing applications programmed in OpenCL. Experimental evaluation of the parallel kernels programmed with the OpenCL framework shows that high-performance can be achieved for distinct parallel computing architectures with low programming effort.

field-programmable custom computing machines | 2012

Shortening Design Time through Multiplatform Simulations with a Portable OpenCL Golden-model: The LDPC Decoder Case

Gabriel Falcao; Muhsen Owaida; David Novo; Madhura Purnaprajna; Nikolaos Bellas; Christos D. Antonopoulos; Georgios Karakonstantis; Andreas Burg; Paolo Ienne

Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bit width and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on Open CL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. Open CL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpen CL (Silicon to Open CL), a tool that automatically converts Open CL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified Open CL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3× faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.

international conference on communications | 2013

Near-LSPA performance at MSA complexity

João Sousa Andrade; Gabriel Falcao; Vitor Silva; João Pedro Barreto; Nuno Gonçalves; Valentin Savin

The tradeoff between error-correcting performance and numerical complexity of LDPC decoding algorithms is a well-known problem. In this paper we depict the unseen error-floor performance of the Self-Corrected Min-Sum algorithm for long length DVB-S2 codes. We developed a massively parallel simulation using GPUs which allowed a comprehensive BER characterization either in the waterfall or in the error-floor region. We show that the self-correction technique increases the BER performance by 0.5 and 0.2 dB, in the waterfall and error-floor region, when compared to the Min-Sum algorithm. Furthermore, it reaches within 0.2 dB to the Logarithmic Sum-Product BER performance and it also outperforms the Normalized Min-Sum at high SNR, a low complexity decoding algorithm which yields good BER performance.

Neural Processing Letters | 2016

Stacked Autoencoders Using Low-Power Accelerated Architectures for Object Recognition in Autonomous Systems

Joao Maria; Joao Amaro; Gabriel Falcao; Luís A. Alexandre

This paper investigates low-energy consumption and low-power hardware models and processor architectures for performing the real-time recognition of objects in power-constrained autonomous systems and robots. Most recent developments show that convolutional deep neural networks are currently the state-of-the-art in terms of classification accuracy. In this article we propose to use of a different type of deep neural network—stacked autoencoders—and show that within a limited number of layers and nodes, for accommodating the use of low-power accelerators such as mobile GPUs and FPGAs, we are still able to achieve both classification levels not far from the state-of-the-art and a high number of processed frames per second. We present experiments using the color CIFAR-10 dataset. This enables the adaptation of the architecture to a live feed camera. Another novelty equally proposed for the first time in this work suggests that the training phase can also be performed in these low-power devices, instead of the usual approach that uses a desktop CPU or a GPU to perform this task and only runs the trained network later on the FPGA. This allows incorporating new functionalities as, for example, a robot performing runtime learning.

international conference on microelectronics | 2007

High throughput encoder architecture for DVB-S2 LDPC-IRA codes

Marco Gomes; Gabriel Falcao; Alexandre Sengo; Vitor Ferreira; Vitor Silva; Miguel Falcão

Due to their excellent bit-error-rate performance, low density parity check codes (LDPC) have been adopted by the recent digital video satellite broadcast standard (DVB-S2). In order to simplify the encoding procedure, irregular repeat and accumulate (IRA) LDPC codes have been chosen. This paper proposes an efficient, low delay and high throughput encoder architecture shared by all DVB-S2 LDPC-IRA codes. The architecture explores the periodic structure of the adopted codes by performing on the fly partial-parallel computation of the parity check bits. The architecture implementation on a XC2VP30 Virtex2P Xilinx FPGA (@131.7 MHz) shows a minimum throughput of 5.93 Gb/s in worst case conditions. Synthesis results are also presented.

Explore More