Bertrand Le Gal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bertrand Le Gal is active.

Explore More

Publication

Featured researches published by Bertrand Le Gal.

IEEE Transactions on Very Large Scale Integration Systems | 2008

Dynamic Memory Access Management for High-Performance DSP Applications Using High-Level Synthesis

Bertrand Le Gal; Emmanuel Casseau; Sylvain Huet

Multimedia applications such as video and image processing are often characterized by a huge number of data accesses. In many digital signal processing applications, array access patterns are regular and periodic. In these cases, optimized architectures using pipelined memory access controllers can be generated. In this paper, we focus on implementing memory interfacing modules that can be automatically generated from a high-level synthesis tool and which can efficiently handle predictable address patterns as well as random ones (i.e., dynamic address computations). The benefits of balancing dynamic address computations from datapath to dedicated computation units in the memory controller is also analyzed as well as operator bitwidth optimization and data locality to save power consumption and reduce latency.

IEEE Transactions on Signal Processing | 2015

Multi-Gb/s software decoding of Polar Codes

Bertrand Le Gal; Camille Leroux; Christophe Jego

This paper presents an optimized software implementation of a Successive Cancellation (SC) decoder for polar codes. Despite the strong data dependencies in SC decoding, a highly parallel software polar decoder is devised for x86 processor target. A high level of performance is achieved by exploiting the parallelism inherent in todays processor architectures (SIMD, multicore, etc.). Some optimizations that were originally thought for hardware implementation (memory reduction techniques and algorithmic simplifications) were also applied to enhance the throughput of the software implementation. Finally, some low level optimizations such as explicit assembly description or data packing are used to improve the throughput even more. The resulting decoder description is implemented on different x86 processor targets. An analysis of the decoder in terms of latency and throughput is proposed. The influence of several parameters on the throughput and the latency is investigated: the selected target, the code rate, the code length, the SIMD mode (SSE/AVX), the multithreading mode, etc. The energy per decoded bit is also estimated. The proposed software decoder compares favorably with state of the art software polar decoders. Extensive experimentations demonstrate that the proposed software polar decoder exceeds 1 Gb/s for code lengths N ≤ 217 on a single core and reaches multi-Gb/s throughputs when using four cores in parallel in AVX mode.

IEEE Embedded Systems Letters | 2014

A High Throughput Efficient Approach for Decoding LDPC Codes onto GPU Devices

Bertrand Le Gal; Christophe Jego; Jérémie Crenne

Low density parity check (LDPC) decoding process is known as compute intensive. This kind of digital communication applications was recently implemented onto graphic processing unit (GPU) devices for LDPC code performance estimation and/or for real-time measurements. Overall previous studies about LDPC decoding on GPU were based on the implementation of the flooding-based decoding algorithm that provides massive computation parallelism. More efficient layered schedules were proposed in literature because decoder iteration can be split into sublayer iterations. These schedules seem to badly fit onto GPU devices due to restricted computation parallelism and complex memory access patterns. However, the layered schedules enable the decoding convergence to speed up by two. In this letter, we show that: 1) layered schedule can be efficiently implemented onto a GPU device; and 2) this approach-implemented onto a low-cost GPU device-provides higher throughputs with identical correction performances (BER) compared to previously published results.

IEEE Transactions on Parallel and Distributed Systems | 2016

High-Throughput Multi-Core LDPC Decoders Based on x86 Processor

Bertrand Le Gal; Christophe Jego

Low-Density Parity-Check (LDPC) codes are an efficient way to correct transmission errors in digital communication systems. Although initially targeting strictly to ASICs due to computation complexity, LDPC decoders have been recently ported to multicore and many-core systems. Most works focused on taking advantage of GPU devices. In this paper, we propose an alternative solution based on a layered OMS/NMS LDPC decoding algorithm that can be efficiently implemented on a multi-core device using Single Instruction Multiple Data (SIMD) and Single Program Multiple Data (SPMD) programming models. Several experimentations were performed on a x86 processor target. Throughputs up to 170 Mbps were achieved on a single core of an INTEL Core i7 processor when executing 20 layered-based decoding iterations. Throughputs reaches up to 560 Mbps on four INTEL Core-i7 cores. Experimentation results show that the proposed implementations achieved similar BER correction performance than previous works. Moreover, much higher throughputs have been achieved by comparison with all previous GPU and CPU works. They range from x1.4 to x8 by comparison with recent GPU works.

Design Automation for Embedded Systems | 2012

Automatic low-cost IP watermarking technique based on output mark insertions

Bertrand Le Gal; Lilian Bossuet

Today, although intellectual properties (IP) and their reuse are common, their use is causing design security issues: illegal copying, counterfeiting, and reverse engineering. IP watermarking is an efficient way to detect an unauthorized IP copy or a counterfeit. In this context, many interesting solutions have been proposed. However, few combine the watermarking process with synthesis. This article presents a new solution, i.e. automatic low cost IP watermarking included in the high-level synthesis process. The proposed method differs from those cited in the literature as the marking is not material, but is based on mathematical relationships between numeric values as inputs and outputs at specified times. Some implementation results with Xilinx Virtex-5 FPGA that the proposed solution required a lower area and timing overhead than existing solutions.

international conference on systems | 2009

High-level synthesis for the design of FPGA-based signal processing systems

Emmanuel Casseau; Bertrand Le Gal

High-level synthesis (HLS) currently seems to be an interesting process to reduce the design time substantially. HLS tools actually map algorithms to architectures. While such tools were developed targeting ASIC technologies, HLS currently draws wide interest for FPGA designers. However with most of HLS techniques, traditional resource sharing models are very inaccurate for FPGAs: for example, multiplexers can be very expensive with such technologies. Resource usage optimizations and dedicated resource binding have to be applied. In this paper a HLS process which takes care of data-width and combines scheduling and binding to carefully take into account interconnect cost is presented. Experimental results show that our approach achieves significant reduction for area (34%) and dynamic power (28%) compared to a traditional synthesis.

IEEE Communications Letters | 2016

Fast Converging ADMM-Penalized Algorithm for LDPC Decoding

Imen Debbabi; Bertrand Le Gal; Nadia Khouja; Fethi Tlili; Christophe Jego

The alternate direction method of multipliers (ADMM) approach has been recently considered for LDPC decoding. It has been approved to enhance the error rate performance compared with conventional message passing (MP) techniques in both the waterfall and error floor regions at the cost of a higher computation complexity. In this letter, a formulation of the ADMM decoding algorithm with modified computation scheduling is proposed. It increases the error correction performance of the decoding algorithm and reduces the average computation complexity of the decoding process thanks to a faster convergence. Simulation results show that this modified scheduling speeds up the decoding procedure with regards to the ADMM initial formulation while enhancing the error correction performance. This decoding speed-up is further improved when the proposed scheduling is teamed with a recent complexity reduction method detailed in Wei et al.IEEE Commun. Lett., 2015.

symposium on cloud computing | 2006

Bit-Width Aware High-Level Synthesis for Digital Signal Processing Systems

Bertrand Le Gal; Caaliph Andriamisaina; Emmanuel Casseau

In this paper we propose a methodology that takes into account bit-width to optimize area and power consumption of hardware architectures provided by high-level synthesis tools. The methodology is based on a bit-width analysis using information that comes from the designer. This bit-width information is propagated through a graph which models the application. The resulting annotated graph enables datapaph structure optimizations for high-level synthesis without increasing dramatically its processing time (complexity: O(n)). The methodology results in an area reduction from 17% to 43% for on a Sum of Absolute Difference (SAD) computation used in block matching algorithms. The proposed approach can also be applied in a more general design context for sizing the data of an application knowing the input data formats.

Integration | 2012

Design of multi-mode application-specific cores based on high-level synthesis

Emmanuel Casseau; Bertrand Le Gal

In a mobile society, more and more devices need to continuously adapt to changing environments. Such mode switches can be smoothly done in software using a general purpose processor or a digital signal processor. However hardware cores only can cope with both throughput and power consumption constraints. Reconfigurable hardware platforms provided by FPGA devices offer partial reconfiguration at runtime. However they require too long reconfiguration times and they cannot satisfy mobile device power consumption requirements. In this article we propose a methodology to map selected groups of DSP tasks to multi-mode cores using conventional hardware technologies.

Computerized Medical Imaging and Graphics | 2012

FPGA based system for automatic cDNA microarray image processing

Bogdan Belean; Monica Borda; Bertrand Le Gal; Romulus Terebes

Automation is an open subject in DNA microarray image processing, aiming reliable gene expression estimation. The paper presents a novel shock filter based approach for automatic microarray grid alignment. The proposed method brings up significantly reduced computational complexity compared to state of the art approaches, while similar results in terms of accuracy are achieved. Based on this approach, we also propose an FPGA based system for microarray image analysis that eliminates the shortcomings of existing software platforms: user intervention, increased computational time and cost. Our system includes application-specific architectures which involve algorithm parallelization, aiming fast and automated cDNA microarray image processing. The proposed automated image processing chain is implemented both on a general purpose processor and using the developed hardware architectures as co-processors in a FPGA based system. The comparative results included in the last section show that an important gain in terms of computational time is obtained using hardware based implementations.

Explore More