Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Thomas Henretty is active.

Publication


Featured researches published by Thomas Henretty.


compiler construction | 2011

Data layout transformation for stencil computations on short-vector SIMD architectures

Thomas Henretty; Kevin Stock; Louis-Noël Pouchet; Franz Franchetti; J. Ramanujam; P. Sadayappan

Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on three modern SIMD-capable processors.


international conference on supercomputing | 2013

A stencil compiler for short-vector SIMD architectures

Thomas Henretty; Richard Veras; Franz Franchetti; Louis-Noël Pouchet; J. Ramanujam; P. Sadayappan

Stencil computations are an integral component of applications in a number of scientific computing domains. Short-vector SIMD instruction sets are ubiquitous on modern processors and can be used to significantly increase the performance of stencil computations. Traditional approaches to optimizing stencils on these platforms have focused on either short-vector SIMD or data locality optimizations. In this paper, we propose a domain specific language and compiler for stencil computations that allows specification of stencils in a concise manner and automates both locality and short-vector SIMD optimizations, along with effective utilization of multi-core parallelism. Loop transformations to enhance data locality and enable load-balanced parallelism are combined with a data layout transformation to effectively increase the performance of stencil computations. Performance increases are demonstrated for a number of stencils on several modern SIMD architectures.


international conference on parallel architectures and compilation techniques | 2009

Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

Qingda Lu; Christophe Alias; Uday Bondhugula; Thomas Henretty; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan; Yongjian Chen; Haibo Lin; Tin-Fook Ngai

With increasing numbers of cores, future CMPs (Chip Multi-Processors) are likely to have a tiled architecture with a portion of shared L2 cache on each tile and a bank-interleaved distribution of the address space. Although such an organization is effective for avoiding access hot-spots, it can cause a significant number of non-local L2 accesses for many commonly occurring regular data access patterns. In this paper we develop a compile-time framework for data locality optimization via data layout transformation. Using a polyhedral model, the programs localizability is determined by analysis of its index set and array reference functions, followed by non-canonical data layout transformation to reduce non-local accesses for localizable computations. Simulation-based results on a 16-core 2D tiled CMP demonstrate the effectiveness of the approach. The developed program transformation technique is also useful in several other data layout transformation contexts.


symposium on code generation and optimization | 2010

Parameterized tiling revisited

Muthu Manikandan Baskaran; Albert Hartono; Sanket Tavarageri; Thomas Henretty; J. Ramanujam; P. Sadayappan

Tiling, a key transformation for optimizing programs, has been widely studied in literature. Parameterized tiled code is important for auto-tuning systems since they often execute a large number of runs with dynamically varied tile sizes. Previous work on tiled code generation has addressed parameterized tiling for the sequential context, and the parallel case with fixed compile-time constants for tile sizes. In this paper, we revisit the problem of generating tiled code using parametric tile sizes. We develop a systematic approach to formulate tiling transformations through manipulation of linear inequalities and develop a novel approach to overcoming the fundamental obstacle faced by previous approaches regarding generation of parallel parameterized tiled code. To the best of our knowledge, the approach proposed in this paper is the first compile-time solution to the problem of parallel parameterized code generation for affine imperfectly nested loops. Experimental results demonstrate the effectiveness of the implemented system.


Journal of Physical Chemistry A | 2009

Performance Optimization of Tensor Contraction Expressions for Many Body Methods in Quantum Chemistry

Albert Hartono; Qingda Lu; Thomas Henretty; Sriram Krishnamoorthy; huaijian zhang; Gerald Baumgartner; David E. Bernholdt; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan

Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization with common subexpression identification and demonstrates its effectiveness on tensor contraction expressions for coupled cluster equations. The second part of the paper highlights the importance of data layout transformation in the optimization of tensor contraction computations on modern processors. A number of considerations, such as minimization of cache misses and utilization of multimedia vector instructions, are discussed. A library for efficient index permutation of multidimensional tensors is described, and experimental performance data is provided that demonstrates its effectiveness.


international parallel and distributed processing symposium | 2011

Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel

Kevin Stock; Thomas Henretty; Iyyappa Murugandi; P. Sadayappan; Robert J. Harrison

In this paper, we describe a model-driven compile-time code generator that transforms a class of tensor contraction expressions into highly optimized short-vector SIMD code. We use as a case study a multi-resolution tensor kernel from the MADNESS quantum chemistry application. Performance of a C-based implementation is low, and because the dimensions of the tensors are small, performance using vendor optimized BLAS libraries is also sub optimal. We develop a model-driven code generator that determines the optimal loop permutation and placement of vector load/store, transpose, and splat operations in the generated code, enabling portable performance on short-vector SIMD architectures. Experimental results on an SSE-based platform demonstrate the efficiency of the vector-code synthesizer.


ieee international conference on high performance computing data and analytics | 2015

SDSLc: a multi-target domain-specific compiler for stencil computations

Prashant Singh Rawat; Martin Kong; Thomas Henretty; Justin Holewinski; Kevin Stock; Louis-Noël Pouchet; J. Ramanujam; Atanas Rountev; P. Sadayappan

Stencil computations are at the core of applications in a number of scientific computing domains. We describe a domain-specific language for regular stencil computations that allows specification of the computations in a concise manner. We describe a multi-target compiler for this DSL, which generates optimized code for GPUa, FPGAs, and multi-core processors with short-vector SIMD instruction sets, considering both low-order and high-order stencil computations. The hardware differences between these three types of architecture prompt different optimization strategies for the compiler. We evaluate the domain-specific compiler using a number of benchmarks on CPU, GPU and FPGA platforms.


software visualization | 2015

Polyhedral user mapping and assistant visualizer tool for the r-stream auto-parallelizing compiler

Eric Papenhausen; Bing Wang; Harper Langston; Muthu Manikandan Baskaran; Thomas Henretty; Taku Izubuchi; Ann Johnson; Chulwoo Jung; Meifeng Lin; Benoît Meister; Klaus Mueller; Richard Lethin

Existing high-level, source-to-source compilers can accept input programs in a high-level language (e.g., C) and perform complex automatic parallelization and other mappings using various optimizations. These optimizations often require trade-offs and can benefit from the users involvement in the process. However, because of the inherent complexity, the barrier to entry for new users of these high-level optimizing compilers can often be high. We propose visualization as an effective gateway for non-expert users to gain insight into the effects of parameter choices and so aid them in the selection of levels best suited to their specific optimization goals.


international conference on parallel processing | 2016

Scalable Hierarchical Polyhedral Compilation

Benoît Pradelle; Benoît Meister; Muthu Manikandan Baskaran; Athanasios Konstantinidis; Thomas Henretty; Richard Lethin

Computers across the board, from embedded to future exascale computers, are consistently designed with deeper memory hierarchies. While this opens up exciting opportunities for improving software performance and energy efficiency, it also makes it increasingly difficult to efficiently exploit the hardware. Advanced compilation techniques are a possible solution to this difficult problem and, among them, the polyhedral compilation technology provides a pathway for performing advanced automatic parallelization and code transformations. However, the polyhedral model is also known for its poor scalability with respect to the number of dimensions in the polyhedra that are used for representing programs. Although current compilers can cope with such limitation when targeting shallow hierarchies, polyhedral optimizations often become intractable as soon as deeper hardware hierarchies are considered. We address this problem by introducing two new operators for polyhedral compilers: focalisation and defocalisation. When applied in the compilation flow, the new operators reduce the dimensionality of polyhedra, which drastically simplifies the mathematical problems solved during the compilation. We prove that the presented operators preserve the original program semantics, allowing them to be safely used in compilers. We implemented the operators in a production compiler, which drastically improved its performance and scalability when targeting deep hierarchies.


ieee high performance extreme computing conference | 2015

Automatic cluster parallelization and minimizing communication via selective data replication

Sanket Tavarageri; Benoît Meister; Muthu Manikandan Baskaran; Benoît Pradelle; Thomas Henretty; Athanasios Konstantinidis; Ann Johnson; Richard Lethin

The technology scaling has initiated two distinct trends that are likely to continue into future: first, the increased parallelism in hardware and second, the increasing performance and energy cost of communication relative to computation. Both of the above trends call for development of compiler and runtime systems to automatically parallelize programs and reduce communication in parallel computations to achieve the desired high performance in an energy-efficient fashion. In this paper, we propose the design of an integrated compiler and runtime system that auto-parallelizes loop-nests to clusters and, a novel communication avoidance method that reduces data movement between processors. Communication minimization is achieved via data replication: data is replicated so that a larger share of the whole data set may be mapped to a processor and hence, non-local memory accesses reduced. Experiments on a number of benchmarks show the effectiveness of the approach.

Collaboration


Dive into the Thomas Henretty's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

J. Ramanujam

Louisiana State University

View shared research outputs
Top Co-Authors

Avatar

Janice O. McMahon

University of Southern California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge