Bernard Goossens | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bernard Goossens is active.

Explore More

Publication

Featured researches published by Bernard Goossens.

international conference on conceptual structures | 2013

Limits of Instruction-Level Parallelism Capture

Bernard Goossens; David Parello

Abstract We analyse the capacity of different running models to benefit from the Instruction-Level Parallelism (ILP). First, we show where the locks to the capture of distant ILP reside. We show that i) fetching in parallel, ii) renaming memory references and iii) removing parasitic true dependencies on the stack management are the keys to capture distant ILP. Second, we measure the potential of a new running model, named speculative forking, in which a run is dynamically multi-threaded by forking at every function and loop entry frontier and threads communicate to link renamed consumers to their producers. We show that a run can be automatically parallelized by speculative forking and extended renaming. Most of the distant ILP, increasing with the data size, can be captured for properly compiled programs based on parallel algorithms.

parallel computing | 2010

PerPI: a tool to measure instruction level parallelism

Bernard Goossens; Philippe Langlois; David Parello; Eric Petit

We introduce and describe PerPI, a software tool analyzing the instruction level parallelism (ILP) of a program. ILP measures the best potential of a program to run in parallel on an ideal machine --- a machine with infinite resources. PerPI is a programmer-oriented tool the function of which is to improve the understanding of how the algorithm and the (micro-) architecture will interact. PerPI fills the gap between the manual analysis of an abstract algorithm and implementation-dependent profiling tools. The current version provides reproducible measures of the average number of instructions per cycle executed on an ideal machine, histograms of these instructions and associated data-flow graphs for any x86 binary file. We illustrate how these measures explain the actual performance of core numerical subroutines when measured run times cannot be correlated with the classical flop count analysis.

Technique Et Science Informatiques | 2006

Ordonnancement distribué d'instructions

Bernard Goossens; David Defour

This article presents an algorithm to perform a distributed computation of the instructions, suited to high degree superscalar microarchitectures. The method relies on a partitionning of both the register file and the reservation stations in order to decrease the number of register file access ports and the number of stations comparators. Matching the results with the depending sources is no more global but point to point thanks to an identification of the instructions and their components. The method, by limiting the access resources to each renaming register to four ports allows, despite an increase of the number of registers, to keep the access lime beyond the cycle time.

Future Generation Computer Systems | 2005

The instruction register file micro-architecture

Bernard Goossens; David Defour

In this paper, we address the issue of feeding future superscalar processor cores with enough instructions. Hardware techniques targeting an increase in the instruction fetch bandwidth have been proposed such as the trace cache microarchitecture. We present a microarchitecture solution based on a register file holding basic blocks of instructions. This solution places the instruction memory hierarchy out of the cycle determining path. We call our approach, instruction register file (IRF). We estimate our approach with a SimpleScalar based simulator run on the Mediabench benchmark suite and compare to the trace cache performance on the same benchmarks. We show that on this benchmark suite, an IRF-based processor fetching up to three basic blocks per cycle outperforms a trace-cache-based processor fetching 16 instructions long traces by 25% on the average.

ComPAS: Conférence en Parallélisme, Architecture et Système | 2014