Martin Langhammer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martin Langhammer is active.

Explore More

Publication

Featured researches published by Martin Langhammer.

field programmable gate arrays | 2015

Floating-Point DSP Block Architecture for FPGAs

Martin Langhammer; Bogdan Pasca

This work describes the architecture of a new FPGA DSP block supporting both fixed and floating point arithmetic. Each DSP block can be configured to provide one single precision IEEE-754 floating multiplier and one IEEE-754 floating point adder, or when configured in fixed point mode, the block is completely backwards compatible with current FPGA DSP blocks. The DSP block operating frequency is similar in both modes, in the region of 500MHz, offering up to 2 GMACs fixed point and 1 GFLOPs performance per block. In floating point mode, support for multi-block vector modes are provided, where multiple blocks can be seamlessly assembled into any size real or complex dot products. By efficient reuse of the fixed point arithmetic modules, as well as the fixed point routing, the floating point features have only minimal power and area impact. We show how these blocks are implemented in a modern Arria 10 FPGA family, offering over 1 TFLOPs using only embedded structures, and how scaling to multiple TFLOPs densities is possible for planned devices.

field programmable gate arrays | 2009

Cholesky decomposition using fused datapath synthesis

Suleyman Sirri Demirsoy; Martin Langhammer

In this paper we present an implementation of a Cholesky decomposition core, with IEEE754 single precision arithmetic. The datapaths are generated using fused datapath synthesis, created with an experimental floating point compiler tool, capable of fitting hundreds of floating point operators into a single device. We present a scalable architecture for both real and complex matrixes, on which we will report results for up to 128x128 real matrices. The concepts of fused datapath synthesis for FPGA floating point designs will be reviewed, and the application to the Cholesky algorithm detailed. Experimental results will be given to show that the accuracy of this method is superior to those expected from a traditional IEEE754 core based design flow.

symposium on computer arithmetic | 2015

Design and Implementation of an Embedded FPGA Floating Point DSP Block

Martin Langhammer; Bogdan Pasca

This paper describes the architecture and implementation, from both the standpoint of target applications as well as circuit design, of an FPGA DSP Block that can efficiently support both fixed and single precision (SP) floating-point (FP) arithmetic. Most contemporary FPGAs embed DSP blocks that provide simple multiply-add-based fixed-point arithmetic cores. Current FP arithmetic FPGA solutions make use of these hardened DSP resources, together with embedded memory blocks and soft logic resources, however, larger systems cannot be efficiently implemented due to the routing and soft logic limitations on the devices, resulting in significant area, performance, and power consumption penalties compared to ASIC implementations. In this paper we analyse earlier proposed embedded FP implementations, and show why they are not suitable for a production FPGA. We contrast these against our solution -- a unified DSP Block -- where (a) the SP FP multiplier is overlaid on the fixed point constructs, (b) the SP FP Adder/Subtracter is integrated as a separate unit, and (c) the multiplier and adder can be combined in a way that is both arithmetically useful, but also efficient in terms of FPGA routing density and congestion. In addition, a novel way of seamlessly combining any number of DSP Blocks in a low latency structure will be introduced. We will show that this new approach allows a low cost, low power, and high density FP platform on current production 20nm FPGAs. We also describe a future enhancement of the DSP block that can support subnormal numbers.

asilomar conference on signals, systems and computers | 2008

High performance matrix multiply using fused datapath operators

Martin Langhammer

The numerous resources on current FPGA devices make them attractive for the implementation of high performance computing and scientific algorithm acceleration. Floating point arithmetic is required for many of these applications, which can be more easily supported by the newer generation of embedded blocks such as larger memories and multipliers. This paper describes a high performance, scalable, dense matrix multiply implementation based on a vector operator function generated with an experimental floating point datapath compiler. Either single precision or double precision datapaths can be compiled, providing in excess of 50 GFLOPs double precision or 100 GFLOPs single precision in Altera Stratix reg III devices. In particular, the contribution of this work is to demonstrate that using fused datapath synthesis for linear algebra applications will allow the entire theoretical floating point capability of the FPGA to be used, and to introduce a benchmark for the correct embedded multiplier to soft logic ratio for these devices.

custom integrated circuits conference | 2015

Arria™ 10 device architecture

Jeffrey Tyhach; Michael D. Hutton; Sean R. Atsatt; Arifur Rahman; Brad Vest; David Lewis; Martin Langhammer; Sergey Shumarayev; Tim Tri Hoang; Allen Chan; Dong-myung Choi; Dan Oh; Hae-Chang Lee; Jack Chui; Ket Chiew Sia; Edwin Yew Fatt Kok; Wei-Yee Koay; Boon-Jin Ang

This paper presents the architecture of Arria 10, a high-density FPGA family built on the TSMC 20SOC process. The design of the device includes an embedded dual-core 1.5 GHz ARM A9 subsystem with peripherals, more than 1M logic elements (LEs) and 1.7M user flip-flops, and 64Mb of embedded memory organized into configurable memory blocks. The Arria 10 family is also the first mainstream FPGA family to include hardened single-precision IEEE 754 floating point, with an aggregate throughput of 1.3 TFLOPs. Device I/O consists of 28G programmable transceivers with an enhanced PMA architecture hardened PCIe sub-blocks and hardened DDR external memory controllers. New methods for digitally-assisted analog calibration are used to address process variation. The fabric is optimized for an aggressive die-size reduction and power improvement over 28nm FPGAs and includes features such as time-borrowing FFs for micro-retiming, tri-stated long-lines for improved routability, programmable back-bias at LAB-cluster granularity and power-management features such as Smart-VID for balancing leakage and performance across the process distribution.

symposium on computer arithmetic | 2017

QRD for Parallel Arithmetic Structures

Martin Langhammer

We present a new organization of the QR decomposition (QRD), which is optimized for implementation on parallel arithmetic structures, such as found in current FPGAs. Data dependencies are hidden in the pipeline depths of the datapaths, allowing implementations to approach 100% sustained to peak throughput. The algorithm presented here is based on the Modified Gram-Schmidt (MGS) method, and is designed for floating point (FP) arithmetic, with a combination of separate dot product and multiply-add datapaths. In this short paper, we concentrate on the description of the algorithm and architecture, rather than the implementation, of the QRD.

field-programmable logic and applications | 2013

Efficient floating-point polynomial evaluation on FPGAS

Martin Langhammer; Bogdan Pasca

Many applications require the evaluation of polynomials having floating-point coefficients - one example is rational polynomial approximation, often used to implement some special functions. The most resource efficient polynomial evaluation scheme (Horner) is costly to implement on FPGAs due to the high cost associated with floating-point arithmetic. Floating-point adders are particularly costly due to their alignment stages requiring large barrel shifters. In this work we present a novel FPGA-specific technique for evaluating polynomials using the Horner scheme. Our technique removes the majority of alignment shifters present in floating-point adders by building a fused evaluation operator. It pushes the possible alignment values of the monomials into tables containing multiple shifted coefficient instances which are selected using the exponent of the input argument. Compared to operator assembly this work reduces circuit latency by 30-50% and logic consumption by 40-60%. Our work can be easily extended to other polynomial evaluation methods.

field-programmable custom computing machines | 2013

Elementary Function Implementation with Optimized Sub Range Polynomial Evaluation

Martin Langhammer; Bogdan Pasca

Efficient elementary function implementations require primitives optimized for modern FPGAs. Fixed-point function generators are one such type of primitives. When built around piecewise polynomial approximations they make use of memory blocks and embedded multipliers, mapping well to contemporary FPGAs. Another type of primitive which can exploit the power series expansions of some elementary functions is floating-point polynomial evaluation. The high costs traditionally associated with floating-point arithmetic made this primitive unattractive for elementary function implementation on FPGAs. In this work we present a novel and efficient way of implementing floating-point polynomial evaluators on a restricted input range. We show on the atan(x) function in double precision that this very different technique reduces memory block count by up to 50% while only slightly increasing DSP count compared to the best implementation built around polynomial approximation fixed-point primitives.

field programmable gate arrays | 2013

Faithful single-precision floating-point tangent for FPGAs

Martin Langhammer; Bogdan Pasca

This paper presents an FPGA-specific implementation of the floating-point tangent function. The implementation inputs values in the interval [-π/2,π/2], targets the IEEE-754 single-precision format and has an accuracy of 1 ulp. The proposed work is based on a combination of mathematical identities and properties of the tangent function in floating point. The architecture was designed having the {Stratix-IV} DSP and memory blocks in mind but should map well on any contemporary FPGA featuring embedded multiplier and memory blocks. It outperforms generic polynomial approximation targeting the same resource spectrum and provides better resources trade-offs than classical CORDIC-based implementations.The presented work is widely available as being part of the Altera DSP Builder Advanced Blockset.

symposium on computer arithmetic | 2011

Teraflop FPGA Design

Martin Langhammer

User requirements for signal processing have increased in line with, or greater than, the increase in FPGA resources and capability. Many current signal processing algorithms require floating point, especially for military applications such as radar. Also, the increasing system complexity of these designs necessitate increased designer productivity, and floating point allows an easier implementation of the system model than the fixed point arithmetic that FPGA devices have been traditionally architected for. This article will review devices and methods for achieving consistent high performance system implementations in floating point. Single device designs at over 200 GFLOPs at the 40nm node, and approaching 1 Teraflop at 28nm will be described.

Explore More