Alban Douillet | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alban Douillet is active.

Explore More

Publication

Featured researches published by Alban Douillet.

ACM Transactions on Architecture and Code Optimization | 2007

Single-dimension software pipelining for multidimensional loops

Hongbo Rong; Zhizhong Tang; R. Govindarajan; Alban Douillet; Guang R. Gao

Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to outer loops. We propose a three-step approach, called single-dimension software pipelining (SSP), to software pipeline a loop nest at an arbitrary loop level. The first step identifies the most profitable loop level for software pipelining in terms of initiation rate or data reuse potential. The second step simplifies the multidimensional data-dependence graph (DDG) into a 1-dimensional DDG and constructs a 1-dimensional schedule for the selected loop level. The third step derives a simple mapping function which specifies the schedule time for the operations of the multidimensional loop, based on the 1-dimensional schedule. We prove that the SSP method is correct and at least as efficient as other modulo scheduling methods. We establish the feasibility and correctness of our approach by implementing it on the IA-64 architecture. Experimental results on a small number of loops show significant performance improvements over existing modulo scheduling methods that software pipeline a loop nest from the innermost loop.

compiler construction | 2001

Speculative Prefetching of Induction Pointers

Artour Stoutchinin; José Nelson Amaral; Guang R. Gao; James C. Dehnert; Suneel Jain; Alban Douillet

We present an automatic approach for prefetching data for linked list data structures. The main idea is based on the observation that linked list elements are frequently allocated at constant distance from one another in the heap. When linked lists are traversed, a regular pattern of memory accesses with constant stride emerges. This regularity in the memory footprint of linked lists enables the development of a prefetching framework where the address of the element accessed in one of the future iterations of the loop is dynamically predicted based on its previous regular behavior. We automatically identify pointer-chasing recurrences in loops that access linked lists. This identification uses a surprisingly simple method that looks for induction pointers -- pointers that are updated in each loop iteration by a load with a constant offset. We integrate induction pointer prefetching with loop scheduling. A key intuition incorporated in our framework is to insert prefetches only if there are processor resources and memory bandwidth available. In order to estimate available memory bandwidth we calculate the number of potential cache misses in one loop iteration. Our estimation algorithm is based on an application of graph coloring on a memory access interference graph derived from the control flow graph. We implemented the prefetching framework in an industry-strength production compiler, and performed experiments on ten benchmark programs with linked lists. We observed performance improvements between 15% and 35% in three of them.

symposium on code generation and optimization | 2004

Single-dimension software pipelining for multi-dimensional loops

Hongbo Rong; Zhizhong Tang; R. Govindarajan; Alban Douillet; Guang R. Gao

programming language design and implementation | 2005

Register allocation for software pipelined multi-dimensional loops

Hongbo Rong; Alban Douillet; Guang R. Gao

Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop iterations to explore instruction-level parallelism from the entire n-dimensional iteration space. This paper investigates register allocation for software pipelined multi-dimensional loops.For single loop software pipelining, the lifetime instances of a loop variant in successive iterations of the loop form a repetitive pattern. An effective register allocation method is to represent the pattern as a vector of lifetimes (or a vector lifetime using Raus terminology) and map it to rotating registers. Unfortunately, the software pipelined schedule of a multi-dimensional loop is considerably more complex, and so are the vector lifetimes in it.In this paper, we develop a way to normalize and represent vector lifetimes in multi-dimensional loop software pipelining, which capture their complexity, while exposing their regularity that enables us to develop a simple, yet powerful solution. Our algorithm is based on the development of a metric, called distance, that quantitatively determines the degree of potential overlapping (conflicts) between two vector lifetimes. We show how to calculate and use the distance, conservatively or aggressively, to guide the register allocation of the vector lifetimes under a bin-packing algorithm framework. The classical register allocation for software pipelined single loops is subsumed by our method as a special case.The method has been implemented in the ORC compiler and produced code for the Itanium architecture. We report the effectiveness of our method on 134 loop nests with 348 loop levels. Several strategies for register allocation are compared and analyzed.

international conference on parallel architectures and compilation techniques | 2007

Software-Pipelining on Multi-Core Architectures

Alban Douillet; Guang R. Gao

It is becoming increasingly evident that multi-core chip architecture are emerging as a solution to efficiently amortizing the ever-growing number of transistors on a chip. However the success of such multi-core chips depends on the advances in system software technology, such as compiler and run-time system, in order for the application programs to exploit thread level parallelism out of originally single-threaded applications and to fully utilize the hardware on-chip concurrency. In this paper, we propose a method which, from a parallel and non-parallel imperfect loop nest written in a standard sequential language such as C or Fortran, automatically generates a multi-threaded software-pipelined schedule for multi-core architectures. The generated schedule already contains all the necessary synchronization instructions and is guaranteed free of deadlocks and buffer overflow. The feasibility of the proposed method within a modern compiler infrastructure has been verified through a pilot implementation in the Open64 compiler and tested on the IBM Cyclops multi-core architecture. Experimental results show that the performance exhibits good scalability even with 100 cores. Our light-weight synchronization mechanism minimizes the dependencies stalls and synchronization overheads without the use of dedicated hardware support.

symposium on code generation and optimization | 2004

Code generation for single-dimension software pipelining of multi-dimensional loops

Hongbo Rong; Alban Douillet; R. Govindarajan; Guang R. Gao

Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to the outer loops. We proposed a scheduling method, called single-dimension software pipelining (SSP), to software pipeline a multidimensional loop nest at an arbitrary loop level. We describe our solution to SSP code generation. In contrast to traditional software pipelining, SSP handles two distinct repetitive patterns, and thus requires new code generation algorithms. Further, these two distinct repetitive patterns complicate register assignment and require two levels of register renaming. As rotating registers support renaming at only one level, our solution is based on a combination of dynamic register renaming (using rotating registers) and static register renaming (using code replication). Finally, code size increase, an even more important issue for SSP than for traditional software-pipelining, is also addressed. Optimizations are proposed to reduce code size without significant performance degradation. We first present a code generation scheme and subsequently implement it for the IA-64 architecture, making effective use of rotating registers and predicated execution. We present some initial experimental results, which demonstrate not only the feasibility and correctness of our code generation scheme, but also its code quality.

languages and compilers for parallel computing | 2005

Register pressure in software-pipelined loop nests: fast computation and impact on architecture design

Alban Douillet; Guang R. Gao

Recently the Single-dimension Software Pipelining (SSP) technique was proposed to software pipeline loop nests at an arbitrary loop level [18,19,20]. However, SSP schedules require a high number of rotating registers, and may become infeasible if register needs exceed the number of available registers. It is therefore desirable to design a method to compute the register pressure quickly (without actually performing the register allocation) as an early measure of the feasibility of an SSP schedule. Such a method can also be instrumental to provide a valuable feedback to processor architects in their register files design decision, as far as the needs of loop nests are concerned. This paper presents a method that computes quickly the minimum number of rotating registers required by an SSP schedule. The results have demonstrated that the method is always accurate and is 3 to 4 orders of magnitude faster on average than the register allocator. Also, experiments suggest that 64 floating-point rotating registers are in general enough to accommodate the needs of the loop nests used in scientific computations.

ACM Transactions on Programming Languages and Systems | 2008

Register allocation for software pipelined multidimensional loops

Hongbo Rong; Alban Douillet; Guang R. Gao

This article investigates register allocation for software pipelined multidimensional loops where the execution of successive iterations from an n-dimensional loop is overlapped. For single loop software pipelining, the lifetimes of a loop variable in successive iterations of the loop form a repetitive pattern. An effective register allocation method is to represent the pattern as a vector of lifetimes (or a vector lifetime using Raus terminology [Rau 1992]) and map it to rotating registers. Unfortunately, the software pipelined schedule of a multidimensional loop is considerably more complex and so are the vector lifetimes in it. In this article, we develop a way to normalize and represent the vector lifetimes, which captures their complexity, while exposing their regularity that enables a simple solution. The problem is formulated as bin-packing of the multidimensional vector lifetimes on the surface of a space-time cylinder. A metric, called distance, is calculated either conservatively or aggressively to guide the bin-packing process, so that there is no overlapping between any two vector lifetimes, and the register requirement is minimized. This approach subsumes the classical register allocation for software pipelined single loops as a special case. The method has been implemented in the ORC compiler and produced code for the IA-64 architecture. Experimental results show the effectiveness. Several strategies for register allocation are compared and analyzed.

international conference on parallel processing | 2006

Multi-dimensional kernel generation for loop nest software pipelining

Alban Douillet; Hongbo Rong; Guang R. Gao

Single-dimension Software Pipelining (SSP) has been proposed as an effective software pipelining technique for multi-dimensional loops [16]. This paper introduces for the first time the scheduling methods that actually produce the kernel code. Because of the multi-dimensional nature of the problem, the scheduling problem is more complex and challenging than with traditional modulo scheduling. The scheduler must handle multiple subkernels and initiation rates under specific scheduling constraints, while producing a solution that minimizes the execution time of the final schedule. In this paper three approaches are proposed: the level-by-level method, which schedules operations in loop level order, starting from the innermost, and does not let other operations interfere with the already scheduled levels, the flat method, which schedules operations from different loop levels with the same priority, and the hybrid method, which uses the level-by-level mechanism for the innermost level and the flat solution for the other levels. The methods subsume Huffs modulo scheduling [8] for single loops as a special case. We also break a scheduling constraint introduced in earlier publications and allow for a more compact kernel. The proposed approaches were implemented in the Open64/ORC compiler, and evaluated on loop nests from the Livermore, SPEC200 and NAS benchmarks.

Archive | 2004