Louis-Noël Pouchet | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Louis-Noël Pouchet is active.

Explore More

Publication

Featured researches published by Louis-Noël Pouchet.

ACM Transactions on Architecture and Code Optimization | 2016

Static and Dynamic Frequency Scaling on Multicore CPUs

Wenlei Bao; Changwan Hong; Sudheer Chunduri; Sriram Krishnamoorthy; Louis-Noël Pouchet; Fabrice Rastello; P. Sadayappan

Dynamic Voltage and Frequency Scaling (DVFS) typically adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical DVFS approaches include using default strategies such as running at the lowest or the highest frequency or reacting to the CPU’s runtime load to reduce or increase frequency based on the CPU usage. In this article, we argue that a compile-time approach to CPU frequency selection is achievable for affine program regions and can significantly outperform runtime-based approaches. We first propose a lightweight runtime approach that can exploit the properties of the power profile specific to a processor, outperforming classical Linux governors such as powersave or on-demand for computational kernels. We then demonstrate that, for affine kernels in the application, a purely compile-time approach to CPU frequency and core count selection is achievable, providing significant additional benefits over the runtime approach. Our framework relies on a one-time profiling of the target CPU, along with a compile-time categorization of loop-based code segments in the application. These are combined to determine at compile-time the frequency and the number of cores to use to execute each affine region to optimize energy or energy-delay product. Extensive evaluation on 60 benchmarks and 5 multi-core CPUs show that our approach systematically outperforms the powersave Linux governor while also improving overall performance.

design automation conference | 2017

Accurate High-level Modeling and Automated Hardware/Software Co-design for Effective SoC Design Space Exploration

Wei Zuo; Louis-Noël Pouchet; Andrey Ayupov; Taemin Kim; Chung-Wei Lin; Shinichi Shiraishi; Deming Chen

A desirable feature of a development tool for SoC design is that, given the important applications in the domain to be targeted by the SoC, a powerful hardware-software partitioning engine is available to determine which function(s) shall be mapped to hardware. However, to provide high-quality partitioning, this engine must be able to consider a rich design space of possible alternate hardware and software implementations for each program region candidate for hardware acceleration, in turn making the task of finding the optimal mapping very difficult given the number of design points to consider and the need for accurate modeling of latency, power and area. In this work we propose a novel framework to enable hardware acceleration of performance-critical parts of an application, by addressing the problem of hardware/software partitioning under power and area constraints to minimize the overall program latency. Our flow is based on the LLVM compiler, and focuses on building a scalable compile-time partitioning algorithm while considering large sets of alternative hardware and software implementations for a particular region. To this end we develop a hybrid approach based on mixing semi-random selection of hardware design points and an Integer Linear Programming formulation of the mapping decision, along with iterative refinements of the solution. Experimental results demonstrate the capability of our approach to consider complex designs and yet output near-optimal partitioning decision. Our package is named RIP (Randomized ILP-based Partitioning), and is open source to benefit the research community.

international conference on supercomputing | 2017

Simplification and runtime resolution of data dependence constraints for loop transformations

Diogo Sampaio; Louis-Noël Pouchet; Fabrice Rastello

Loop transformations such as tiling, parallelization or vectorization are essential tools in the quest for high-performance program execution. Precise data dependence analysis is required to determine whether the compiler can apply a transformation or not. In particular, current static analyses typically fail to provide precise enough dependence information when the code contains indirect memory accesses or polynomial subscript functions to index arrays. This leads to considering superfluous may-dependences between instructions that prevent many loop transformations to be applied. In this work we present a new hybrid (static/dynamic) framework that allows to overcome several limitations of purely static dependence analyses: For a given loop transformation, we statically generate a test to be evaluated at runtime. This test allows to determine whether the transformation is valid, and if so triggers the execution of the transformed code, falling back to the original code otherwise. Such test, originally constructed as a loop-based code with O(n2) iterations (n being the number of iterations of the original loop-nest), are reduced to a loop-free test of O(1) complexity thanks to a new quantifier elimination scheme that we introduce in this paper. The precision and low overhead of our method is demonstrated over 25 kernels.

programming language design and implementation | 2018

GPU code optimization using abstract kernel emulation and sensitivity analysis

Changwan Hong; Aravind Sukumaran-Rajam; Jinsung Kim; Prashant Singh Rawat; Sriram Krishnamoorthy; Louis-Noël Pouchet; Fabrice Rastello; P. Sadayappan

In this paper, we develop an approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernels execution. The utility of the bottleneck analysis is demonstrated in two contexts: 1) Coupling the new bottleneck-driven optimization strategy with the OpenTuner auto-tuner: experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve the performance of code from state-of-the-art domain-specific code generators.

acm sigplan symposium on principles and practice of parallel programming | 2018

Register optimizations for stencils on GPUs

Prashant Singh Rawat; Fabrice Rastello; Aravind Sukumaran-Rajam; Louis-Noël Pouchet; Atanas Rountev; P. Sadayappan

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes.

acm sigplan symposium on principles and practice of parallel programming | 2018

Performance modeling for GPUs using abstract kernel emulation

Changwan Hong; Aravind Sukumaran-Rajam; Jinsung Kim; Prashant Singh Rawat; Sriram Krishnamoorthy; Louis-Noël Pouchet; Fabrice Rastello; P. Sadayappan

Performance modeling of GPU kernels is a significant challenge. In this paper, we develop a novel approach to performance modeling for GPUs through abstract kernel emulation along with latency/gap modeling of resources. Experimental results on all benchmarks from the Rodinia suite demonstrate good accuracy in predicting execution time on multiple GPU platforms.

symposium on principles of programming languages | 2017

Analytical modeling of cache behavior for affine programs

Wenlei Bao; Sriram Krishnamoorthy; Louis-Noël Pouchet; P. Sadayappan

Optimizing compilers implement program transformation strategies aimed at reducing data movement to or from main memory by exploiting the data-cache hierarchy. However, instead of attempting to minimize the number of cache misses, very approximate cost models are used, due to the lack of precise compile-time models for misses for hierarchical caches. The current state of practice for cache miss analysis is based on accurate simulation. However, simulation requires time proportional to the dataset/problem size, as well as the number of distinct cache configurations of interest to be evaluated. This paper takes a fundamentally different approach, by focusing on polyhedral programs with static control flow. Instead of relying on costly simulation, a closed-form solution for modeling of misses in a set associative cache hierarchy is developed. This solution can enable program transformation choice at compile time to optimize cache misses. A tool implementing the approach has been developed and used for validation of the framework.

Archive | 2013