Vincent Loechner
University of Strasbourg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vincent Loechner.
application specific systems architectures and processors | 1996
Philippe Clauss; Vincent Loechner
In the area of automatic parallelization of programs, analyzing and transforming loop nests with parametric affine loop bounds requires fundamental mathematical results. The most common geometrical model of iteration spaces, called the polytope model, is based on mathematics dealing with convex and discrete geometry, linear programming, combinatorics and geometry of numbers.In this paper, we present automatic methods for computing the parametric vertices and the Ehrhart polynomial, i.e., a parametric expression of the number of integer points, of a polytope defined by a set of parametric linear constraints.These methods have many applications in analysis and transformations of nested loop programs. The paper is illustrated with exact symbolic array dataflow analysis, estimation of execution time, and with the computation of the maximum available parallelism of given loop nests.
Algorithmica | 2007
Sven Verdoolaege; Rachid Seghir; Kristof Beyls; Vincent Loechner; Maurice Bruynooghe
AbstractMany compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that the enumerator of such a set can be represented by an explicit function consisting of a set of quasi-polynomials, each associated with a chamber in the parameter space. Previously, interpolation was used to obtain these quasi-polynomials, but this technique has several disadvantages. Its worst-case computation time for a single quasi-polynomial is exponential in the input size, even for fixed dimensions. The worst-case size of such a quasi-polynomial (measured in bits needed to represent the quasi-polynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution. Our main contribution is a novel method for calculating the required quasi-polynomials analytically. It extends an existing method, based on Barvinoks decomposition, for counting the number of integer points in a non-parametric polytope. Our technique always produces a solution and computes polynomially-sized enumerators in polynomial time (for fixed dimensions).
International Journal of Parallel Programming | 1997
Vincent Loechner; Doran Wilde
Algorithms specified for parametrically sized problems are more general purpose and more reusable than algorithms for fixed sized problems. For this reason, there is a need for representing and symbolically analyzing linearly parameterized algorithms. An important class of parallel algorithms can be described as systems of parameterized affine recurrence equations (PARE). In this representation, linearly parameterized polyhedra are used to describe the domains of variables. This paper describes an algorithm which computes the set of parameterized vertices of a polyhedron, given its representation as a system of parameterized inequalities. This provides an important tool for the symbolic analysis of the parameterized domains used to define variables and computation domains in PAREs. A library of operations on parameterized polyhedra based on the Polyhedral Library has been written in C and is freely distributed.
The Journal of Supercomputing | 2002
Vincent Loechner; Benoît Meister; Philippe Clauss
A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop.
compilers, architecture, and synthesis for embedded systems | 2004
Sven Verdoolaege; Rachid Seghir; Kristof Beyls; Vincent Loechner; Maurice Bruynooghe
Many optimization techniques, including several targeted specifically at embedded systems, depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that this parametric count can be represented by a set of Ehrhart polynomials. Previously, interpolation was used to obtain these polynomials, but this technique has several disadvantages. Its worst-case computation time for a single Ehrhart polynomial is exponential in the input size, even for fixed dimensions. The worst-case size of such an Ehrhart polynomial (measured in bits needed to represent the polynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution.Our main contribution is a novel method for calculating Ehrhart polynomials analytically. It extends an existing method, based on Barvinoks decomposition, for counting the number of integer points in a non-parametric polytope. Our technique always produces a solution and computes polynomially-sized Ehrhart polynomials in polynomial time (for fixed dimensions).
International Journal of Parallel Programming | 2014
Alexandra Jimborean; Philippe Clauss; Jean-François Dollinger; Vincent Loechner; Juan Manuel Martinez Caamaño
We propose a framework based on an original generation and use of algorithmic skeletons, and dedicated to speculative parallelization of scientific nested loop kernels, able to apply at run-time polyhedral transformations to the target code in order to exhibit parallelism and data locality. Parallel code generation is achieved almost at no cost by using binary algorithmic skeletons that are generated at compile-time, and that embed the original code and operations devoted to instantiate a polyhedral parallelizing transformation and to verify the speculations on dependences. The skeletons are patched at run-time to generate the executable code. The run-time process includes a transformation selection guided by online profiling phases on short samples, using an instrumented version of the code. During this phase, the accessed memory addresses are used to compute on-the-fly dependence distance vectors, and are also interpolated to build a predictor of the forthcoming accesses. Interpolating functions and distance vectors are then employed for dependence analysis to select a parallelizing transformation that, if the prediction is correct, does not induce any rollback during execution. In order to ensure that the rollback time overhead stays low, the code is executed in successive slices of the outermost original loop of the nest. Each slice can be either a parallel version which instantiates a skeleton, a sequential original version, or an instrumented version. Moreover, such slicing of the execution provides the opportunity of transforming differently the code to adapt to the observed execution phases, by patching differently one of the pre-built skeletons. The framework has been implemented with extensions of the LLVM compiler and an x86-64 runtime system. Significant speed-ups are shown on a set of benchmarks that could not have been handled efficiently by a compiler.
Proceedings of the First IFIP TC10 International Workshop on Software Engineering for Parallel and Distributed Systems | 1996
Vincent Loechner; Catherine Mongenet
This paper presents the mathematical notions for the parallelization of DO-Loops used in the tool OPERA currently under development in our team. It aims at giving the user an environment to parallelize problems described by systems of parameterized affine recurrence equations which formalize single-assignment loop nests. The parallelization technique used in OPERA is based on a classical linear spacextime transformation. Its objectives are to visualize the affine dependences of a problem and to propose a set of different parallel solutions depending on various architectural constraints.
compiler construction | 2012
Alexandra Jimborean; Luis Mastrangelo; Vincent Loechner; Philippe Clauss
VMAD (Virtual Machine for Advanced Dynamic analysis) is a platform for advanced profiling and analysis of programs, consisting in a static component and a runtime system. The runtime system is organized as a set of decoupled modules, dedicated to specific instrumenting or optimizing operations, dynamically loaded when required. The program binary files handled by VMAD are previously processed at compile time to include all necessary data, instrumentation instructions and callbacks to the runtime system. For this purpose, the LLVM compiler has been extended to automatically generate multiple versions of the code, each of them tailored for the targeted instrumentation or optimization strategies. The compiler chooses the most suitable intermediate representation for each version, depending on the information to be acquired and on the optimizations to be applied. The control flow graph is adapted to include the new versions and to transfer the control to and from the runtime system, which is in charge of the execution flow orchestration. The strength of our system resides in its extensibility, as one can add support for various new profiling or optimization strategies, independently of the existing modules. VMADs potential is illustrated by presenting several analysis and optimization applications dedicated to loop nests: instrumentation by sampling, dynamic dependence analysis, adaptive version selection.
european conference on parallel processing | 1997
Vincent Loechner; Catherine Mongenet
This paper deals with communication optimization which is a crucial issue in automatic parallelization. From a system of parameterized affine recurrence equations, we propose a heuristic which determines an efficient space-time transformation. It reduces first the distant communications and then the local communications.
acm sigplan symposium on principles and practice of parallel programming | 2012
Alexandra Jimborean; Philippe Clauss; Benoı̂t Pradelle; Luis Mastrangelo; Vincent Loechner
In this paper, we present a Thread-Level Speculation (TLS) framework whose main feature is to be able to speculatively parallelize a sequential loop nest in various ways, by re-scheduling its iterations. The transformation to be applied is selected at runtime with the goal of minimizing the number of rollbacks and maximizing performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we design a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. Adaptability is ensured by considering chunks of code of various sizes, that are launched successively, each of which being parallelized in a different manner, or run sequentially, depending on the currently observed behavior for accessing memory. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.