Is this you? Create Your Porfile

Bogdan Pasca

École normale supérieure de Lyon

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bogdan Pasca is active.

Explore More

Publication

Featured researches published by Bogdan Pasca.

field-programmable logic and applications | 2009

Generating high-performance custom floating-point pipelines

Florent de Dinechin; Cristian Klein; Bogdan Pasca

Custom operators, working at custom precisions, are a key ingredient to fully exploit the FPGA flexibility advantage for high-performance computing. Unfortunately, such operators are costly to design, and application designers tend to rely on less efficient off-the-shelf operators. To address this issue, an open-source architecture generator framework is introduced. Its salient features are an easy learning curve from VHDL, the ability to embed arbitrary synthesizable VHDL code, portability to mainstream FPGA targets from Xilinx and Altera, automatic management of complex pipelines with support for frequency-directed pipeline, and automatic test-bench generation. This generator is presented around the simple example of a collision detector, which it significantly improves in accuracy, DSP count, logic usage, frequency and latency with respect to an implementation using standard floating-point operators.

ACM Sigarch Computer Architecture News | 2010

Multipliers for floating-point double precision and beyond on FPGAs

Sebastian Banescu; Florent de Dinechin; Bogdan Pasca; Radu Tudoran

The implementation of high-precision floating-point applications on reconfigurable hardware requires large multipliers. Full multipliers are the core of floating-point multipliers. Truncated multipliers, trading resources for a well-controlled accuracy degradation, are useful building blocks in situations where a full multiplier is not needed. This work studies the automated generation of such multipliers using the embedded multipliers and adders present in the DSP blocks of current FPGAs. The optimization of such multipliers is expressed as a tiling problem, where a tile represents a hardware multiplier, and super-tiles represent combinations of several hardware multipliers and adders, making efficient use of the DSP internal resources. This tiling technique is shown to adapt to full or truncated multipliers. It addresses arbitrary precisions including single, double but also the quadruple precision introduced by the IEEE-754-2008 standard and currently unsupported by processor hardware. An open-source implementation is provided in the FloPoCo project.

field-programmable logic and applications | 2009

Large multipliers with fewer DSP blocks

Florent de Dinechin; Bogdan Pasca

Recent computing-oriented FPGAs feature DSP blocks including small embedded multipliers. A large integer multiplier, for instance for a double-precision floating-point multiplier, consumes many of these DSP blocks. This article studies three non-standard implementation techniques of large multipliers: the Karatsuba-Ofman algorithm, non-standard multiplier tiling, and specialized squarers. They allow for large multipliers working at the peak frequency of the DSP blocks while reducing the DSP block usage. Their overhead in term of logic resources, if any, is much lower than that of emulating embedded multipliers. Their latency overhead, if any, is very small. Complete algorithmic descriptions are provided, carefully mapped on recent Xilinx and Altera devices, and validated by synthesis results.

field-programmable technology | 2010

Floating-point exponential functions for DSP-enabled FPGAs

Florent de Dinechin; Bogdan Pasca

This article presents a generator of floating-point exponential operators targeting recent FPGAs with embedded memories and DSP blocks. A single-precision operator consumes just one DSP block, 18Kbits of dual-port memory, and 392 slices on Virtex-4. For larger precisions, a generic approach based on polynomial approximation is used and proves more resource-efficient than the literature. For instance a double-precision operator consumes 5 BlockRAM and 12 DSP48 blocks on Virtex-5, or 10 M9k and 22 18×18 multipliers on Stratix III. This approach is flexible and is demonstrated to scale up to quadruple-precision, while enabling frequencies close to the FPGAs nominal frequency. All the proposed architectures are last-bit accurate for all the floating-point range. They are available in the open-source FloPoCo framework.

field-programmable logic and applications | 2010

Pipelined FPGA Adders

Florent de Dinechin; Hong Diep Nguyen; Bogdan Pasca

Integer addition is a universal building block, and applications such as quad-precision floating-point or elliptic curve cryptography now demand precisions well beyond 64 bits. This study explores the trade-offs between size, latency and frequency for pipelined large-precision adders on FPGA. It compares three pipelined adder architectures: the classical pipelined ripple-carry adder, a variation that reduces register count, and an FPGA-specific implementation of the carry-select adder capable of providing lower latency additions at a comparable price. For each of these architectures, resource estimation models are defined, and used in an adder generator that selects the best architecture considering the target FPGA, the target operating frequency, and the addition bit width.

field programmable gate arrays | 2015

Floating-Point DSP Block Architecture for FPGAs

Martin Langhammer; Bogdan Pasca

This work describes the architecture of a new FPGA DSP block supporting both fixed and floating point arithmetic. Each DSP block can be configured to provide one single precision IEEE-754 floating multiplier and one IEEE-754 floating point adder, or when configured in fixed point mode, the block is completely backwards compatible with current FPGA DSP blocks. The DSP block operating frequency is similar in both modes, in the region of 500MHz, offering up to 2 GMACs fixed point and 1 GFLOPs performance per block. In floating point mode, support for multi-block vector modes are provided, where multiple blocks can be seamlessly assembled into any size real or complex dot products. By efficient reuse of the fixed point arithmetic modules, as well as the fixed point routing, the floating point features have only minimal power and area impact. We show how these blocks are implemented in a modern Arria 10 FPGA family, offering over 1 TFLOPs using only embedded structures, and how scaling to multiple TFLOPs densities is possible for planned devices.

field programmable logic and applications | 2012

Correctly rounded floating-point division for DSP-enabled FPGAs

Bogdan Pasca

Floating-point division is a very costly operation in FPGA designs. High-frequency implementations of the classic digit-recurrence algorithms for division have long latencies (of the order of the number fraction bits) and consume large amounts of logic. Additionally, these implementations require important routing resources, making timing closure difficult in complete designs. In this paper we present two multiplier-based architectures for division which make efficient use of the DSP resources in recent Altera FPGAs. By balancing resource usage between logic, memory and DSP blocks, the presented architectures maintain high frequencies is full designs. Additionally, compared to classical algorithms, the proposed architectures have significantly lower latencies. The architectures target faithfully rounded results, similar to most elementary functions implementations for FPGAs but can also be transformed into correctly rounded architectures with a small overhead. The presented architectures are built using the Altera DSP Builder Advanced framework and will be part of the default blockset.

field-programmable logic and applications | 2010

Multiplicative Square Root Algorithms for FPGAs

Florent de Dinechin; Mioara Joldes; Bogdan Pasca; Guillaume Revy

Most current square root implementations for FPGAs use a digit recurrence algorithm which is well suited to their LUT structure. However, recent computing-oriented FPGAs include embedded multipliers and RAM blocks which can also be used to implement quadratic convergence algorithms, very high radix digit recurrences, or polynomial approximation algorithms. The cost of these solutions is evaluated and compared, and a complete implementation of a polynomial approach is presented within the open-source FloPoCo framework. This polynomial approach allows a shorter latency and higher frequency than the digit recurrence approach, and improves over previous multiplicative approaches. However, the cost of IEEE-compliant correct rounding is shown to be very high.

field-programmable logic and applications | 2011

FPGA-Specific Arithmetic Optimizations of Short-Latency Adders

Hong Diep Nguyen; Bogdan Pasca; Thomas B. Preußer

Integer addition is a pervasive operation in FPGA designs. The need for fast wide adders grows with the demand for large precisions as, for example, required for the implementation of IEEE-754 quadruple precision and elliptic-curve cryptography. The FPGA realization of fast and compact binary adders relies on hardware carry chains. These provide a natural implementation environment for the ripple-carry addition (RCA) scheme. As its latency grows linearly with operand width, wide additions call for acceleration, which is quite reasonably achieved by addition schemes built from parallel RCA blocks. This study presents FPGA-specific arithmetic optimizations for the mapping of carry-select and carry-increment adders targeting the hardware carry chains of modern FPGAs. Different trade-offs between latency and area are explored. The proposed architectures can be successfully used in the context of latency-critical systems or as attractive alternatives to deeply pipelined RCA schemes.

symposium on computer arithmetic | 2015

Design and Implementation of an Embedded FPGA Floating Point DSP Block

Martin Langhammer; Bogdan Pasca

This paper describes the architecture and implementation, from both the standpoint of target applications as well as circuit design, of an FPGA DSP Block that can efficiently support both fixed and single precision (SP) floating-point (FP) arithmetic. Most contemporary FPGAs embed DSP blocks that provide simple multiply-add-based fixed-point arithmetic cores. Current FP arithmetic FPGA solutions make use of these hardened DSP resources, together with embedded memory blocks and soft logic resources, however, larger systems cannot be efficiently implemented due to the routing and soft logic limitations on the devices, resulting in significant area, performance, and power consumption penalties compared to ASIC implementations. In this paper we analyse earlier proposed embedded FP implementations, and show why they are not suitable for a production FPGA. We contrast these against our solution -- a unified DSP Block -- where (a) the SP FP multiplier is overlaid on the fixed point constructs, (b) the SP FP Adder/Subtracter is integrated as a separate unit, and (c) the multiplier and adder can be combined in a way that is both arithmetically useful, but also efficient in terms of FPGA routing density and congestion. In addition, a novel way of seamlessly combining any number of DSP Blocks in a low latency structure will be introduced. We will show that this new approach allows a low cost, low power, and high density FP platform on current production 20nm FPGAs. We also describe a future enhancement of the DSP block that can support subnormal numbers.

Explore More