David Defour | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Defour is active.

Explore More

Publication

Featured researches published by David Defour.

international conference on computational science | 2009

Power Consumption of GPUs from a Software Perspective

Sylvain Collange; David Defour; Arnaud Tisserand

GPUs are now considered as serious challengers for high-performance computing solutions. They have power consumptions up to 300 W. This may lead to power supply and thermal dissipation problems in computing centers. In this article we investigate, using measurements, how and where modern GPUs are using energy during various computations in a CUDA environment.

modeling, analysis, and simulation on computer and telecommunication systems | 2010

Barra: A Parallel Functional Simulator for GPGPU

Sylvain Collange; Marc Daumas; David Defour; David Parello

We present Barra, a simulator of Graphics Processing Units (GPU) tuned for general purpose processing (GPGPU). It is based on the UNISIM framework and it simulates the native instruction set of the Tesla architecture at the functional level. The inputs are CUDA executables produced by NVIDIA tools. No alterations are needed to perform simulations. As it uses parallelism, Barra generates detailed statistics on executions in about the time needed by CUDA to operate in emulation mode. We use it to understand and explore the micro-architecture design spaces of GPUs.

international conference on parallel processing | 2009

Dynamic detection of uniform and affine vectors in GPGPU computations

Sylvain Collange; David Defour; Yao Zhang

We present a hardware mechanism which dynamically detects uniform and affine vectors used in SPMD architecture such as Graphics Processing Units, to minimize pressure on the register file and reduce power consumption with minimal architectural modifications. A preliminary experimental analysis conducted with the Barra simulator shows that this optimization can benefit up to 34% of register file reads and 22% of the computations in common GPGPU applications.

IEEE Transactions on Computers | 2005

A new range-reduction algorithm

Nicolas Brisebarre; David Defour; Peter Kornerup; Jean-Michel Muller; Nathalie Revol

Range reduction is a key point for getting accurate elementary function routines. We introduce a new algorithm that is fast for input arguments belonging to the most common domains, yet accurate over the full double precision range.

asilomar conference on signals, systems and computers | 2002

A new scheme for table-based evaluation of functions

David Defour; F. de Dinechin; Jean-Michel Muller

This paper presents a new scheme for the hardware evaluation of functions in fixed-point format, for precisions up to 30 bits. This scheme yields an architecture made of four look-up tables, a multi-operand adder, and two small multipliers. This new method is evaluated and compared with other published methods.

parallel computing | 2015

Numerical reproducibility for the parallel reduction on multi- and many-core architectures

Sylvain Collange; David Defour; Stef Graillat; Roman Iakymchuk

A parallel algorithm to compute correctly-rounded floating-point sumsHighly-optimized implementations for modern CPUs, GPUs and Xeon PhiAs fast as memory bandwidth allows for large sums with moderate dynamic rangeScales well with the problem size and resources used on a cluster of compute nodes On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

Computer Physics Communications | 2008

Line-by-line spectroscopic simulations on graphics processing units

Sylvain Collange; Marc Daumas; David Defour

We report here on software that performs line-by-line spectroscopic simulations on gases. Elaborate models (such as narrow band and correlated-K) are accurate and efficient for bands where various components are not simultaneously and significantly active. Line-by-line is probably the most accurate model in the infrared for blends of gases that contain high proportions of H2O and CO2 as this was the case for our prototype simulation. Our implementation on graphics processing units sustains a speedup close to 330 on computation-intensive tasks and 12 on memory intensive tasks compared to implementations on one core of high-end processors. This speedup is due to data parallelism, efficient memory access for specific patterns and some dedicated hardware operators only available in graphics processing units. It is obtained leaving most of processor resources available and it would scale linearly with the number of graphics processing units in parallel machines. Line-by-line simulation coupled with simulation of fluid dynamics was long believed to be economically intractable but our work shows that it could be done with some affordable additional resources compared to what is necessary to perform simulations on fluid dynamics alone.

Numerical Algorithms | 2004

Proposal for a Standardization of Mathematical Function Implementation in Floating-Point Arithmetic

David Defour; Guillaume Hanrot; Vincent Lefèvre; Jean-Michel Muller; Nathalie Revol; Paul Zimmermann

Some aspects of what a standard for the implementation of the mathematical functions could be are presented. Firstly, the need for such a standard is motivated. Then the proposed standard is given. The question of roundings constitutes an important part of this paper: three levels are proposed, ranging from a level relatively easy to attain (with fixed maximal relative error) up to the best quality one, with correct rounding on the whole range of every function. We do not claim that we always suggest the right choices, or that we have thought about all relevant issues. The mere goal of this paper is to raise questions and to launch the discussion towards a standard.

parallel computing technologies | 2003

Software Carry-Save: A Case Study for Instruction-Level Parallelism

David Defour; Florent de Dinechin

This paper is a practical study of the performance impact of avoiding data-dependencies at the algorithm level, when targeting recent deeply pipelined, superscalar processors. We are interested in multiple-precision libraries offering the equivalent of quad-double precision. We show that a combination of today’s processors, today’s compilers, and algorithms written in C using a data representation which exposes parallelism, is able to outperform the reference GMP library which is partially written in assembler. We observe that the gain is related to a better use of the processor’s instruction parallelism.

conference on advanced signal processing algorithms architectures and implemenations | 2001

Correctly rounded exponential function in double-precision arithmetic

David Defour; Florent de Dinechin; Jean-Michel Muller

We present an algorithm for implementing correctly rounded exponentials in double-precision floating point arithmetic. This algorithm is based on floating-point operations in the widespread EEE-754 standard, and is therefore more efficient than those using multiprecision arithmetic, while being fully portable. It requires a table of reasonable size and IEEE-754 double precision multiplications and additions. In a preliminary implementation, the overhead due to correct rounding is a 6 times slowdown when compared to the standard library function.

Explore More