Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Albert Hartono is active.

Publication


Featured researches published by Albert Hartono.


programming language design and implementation | 2008

A practical automatic polyhedral parallelizer and locality optimizer

Uday Bondhugula; Albert Hartono; J. Ramanujam; P. Sadayappan

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this work, we show the practicality of analytical model-driven automatic transformation in the polyhedral model -- far beyond what is possible by current production compilers. Unlike previous works, our approach is an end-to-end fully automatic one driven by an integer linear optimization framework that takes an explicit view of finding good ways of tiling for parallelism and locality using affine transformations. The framework has been implemented into a tool to automatically generate OpenMP parallel code from C program sections. Experimental results from the tool show very high speedups for local and parallel execution on multi-cores over state-of-the-art compiler frameworks from the research community as well as the best native production compilers. The system also enables the easy use of powerful empirical/iterative optimization for general arbitrarily nested loop sequences.


international parallel and distributed processing symposium | 2009

Annotation-based empirical performance tuning using Orio

Albert Hartono; Boyana Norris; P. Sadayappan

For many scientific applications, significant time is spent in tuning codes for a particular high-performance architecture. Tuning approaches range from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features. Intrusive techniques often result in code changes that are not easily reversible, and can negatively impact readability, maintainability, and performance on different architectures. We introduce an extensible annotation-based empirical tuning system called Orio that is aimed at improving both performance and productivity. It allows software developers to insert annotations in the form of structured comments into their source code to trigger a number of low-level performance optimizations on a specified code fragment. To maximize the performance tuning opportunities, the annotation processing infrastructure is designed to support both architecture-independent and architecture-specific code optimizations. Given the annotated code as input, Orio generates many tuned versions of the same operation and empirically evaluates the alternatives to select the best performing version for production use. We have also enabled the use of the Pluto automatic parallelization tool in conjunction with Orio to generate efficient OpenMP-based parallel code. We describe our experimental results involving a number of computational kernels, including dense array and sparse matrix operations.


international conference on cluster computing | 2006

Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters

Lei Chai; Albert Hartono; Dhabaleswar K. Panda

As new processor and memory architectures advance, clusters start to be built from larger SMP systems, which makes MPI intra-node communication a critical issue in high performance computing. This paper presents a new design for MPI intra-node communication that aims to achieve both high performance and good scalability in a cluster environment. The design distinguishes small and large messages and handles them differently to minimize the data transfer overhead for small messages and the memory space consumed by large messages. Moreover, the design utilizes the cache efficiently and requires no locking mechanisms to achieve optimal performance even with large system size. This paper also explores various optimization strategies to reduce polling overhead and maintain data locality. We have evaluated our design on NUMA and dual core NUMA (non-uniform memory access) systems. The experimental results on NUMA system show that the new design can improve MPI intra-node latency by up to 35% and bandwidth by up to 50% compared to MVAPICH. While running the bandwidth benchmark, the measured L2 cache miss rate is reduced by half. The new design also improves the performance of MPI collective calls by up to 25%. The results on dual core NUMA system show that the new design can achieve 0.48 musec in CMP latency


international conference on supercomputing | 2009

Parametric multi-level tiling of imperfectly nested loops

Albert Hartono; Muthu Manikandan Baskaran; Cédric Bastoul; Albert Cohen; Sriram Krishnamoorthy; Boyana Norris; J. Ramanujam; P. Sadayappan

Tiling is a crucial loop transformation for generating high performance code on modern architectures. Efficient generation of multi-level tiled code is essential for maximizing data reuse in systems with deep memory hierarchies. Tiled loops with parametric tile sizes (not compile-time constants) facilitate runtime feedback and dynamic optimizations used in iterative compilation and automatic tuning. Previous parametric multi-level tiling approaches have been restricted to perfectly nested loops, where all assignment statements are contained inside the innermost loop of a loop nest. Previous solutions to tiling for imperfect loop nests have only handled fixed tile sizes. In this paper, we present an approach to parametric multi-level tiling of imperfectly nested loops. The tiling technique generates loops that iterate over full rectangular tiles, making them amenable to compiler optimizations such as register tiling. Experimental results using a number of computational benchmarks demonstrate the effectiveness of the developed tiling approach.


symposium on code generation and optimization | 2010

Parameterized tiling revisited

Muthu Manikandan Baskaran; Albert Hartono; Sanket Tavarageri; Thomas Henretty; J. Ramanujam; P. Sadayappan

Tiling, a key transformation for optimizing programs, has been widely studied in literature. Parameterized tiled code is important for auto-tuning systems since they often execute a large number of runs with dynamically varied tile sizes. Previous work on tiled code generation has addressed parameterized tiling for the sequential context, and the parallel case with fixed compile-time constants for tile sizes. In this paper, we revisit the problem of generating tiled code using parametric tile sizes. We develop a systematic approach to formulate tiling transformations through manipulation of linear inequalities and develop a novel approach to overcoming the fundamental obstacle faced by previous approaches regarding generation of parallel parameterized tiled code. To the best of our knowledge, the approach proposed in this paper is the first compile-time solution to the problem of parallel parameterized code generation for affine imperfectly nested loops. Experimental results demonstrate the effectiveness of the implemented system.


international parallel and distributed processing symposium | 2010

DynTile: Parametric tiled loop generation for parallel execution on multicore processors

Albert Hartono; Muthu Manikandan Baskaran; J. Ramanujam; P. Sadayappan

Loop tiling is an important compiler transformation used for enhancing data locality and exploiting coarsegrained parallelism. Tiled codes in which tile sizes are runtime parameters — called parametrically-tiled codes — are important for empirical tuning systems like ATLAS. Some recent work has addressed the problem of generating sequential parametric tiled code. In this paper we describe DynTile, a system for transforming untiled sequential input C code containing affine imperfectly nested loops to parametrically tiled code for parallel execution on multicore processors. The effectiveness of the system is demonstrated using a number of benchmarks on an eight-core system.


international conference on computational science | 2005

Automated operation minimization of tensor contraction expressions in electronic structure calculations

Albert Hartono; Alexander Sibiryakov; Marcel Nooijen; Gerald Baumgartner; David E. Bernholdt; So Hirata; Chi-Chung Lam; Russell M. Pitzer; J. Ramanujam; P. Sadayappan

Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the Coupled Cluster method. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions, but the optimization problem is NP-hard. Operation minimization is an important optimization step for the Tensor Contraction Engine, a tool being developed for the automatic transformation of high-level tensor contraction expressions into efficient programs. In this paper, we develop an effective heuristic approach to the operation minimization problem, and demonstrate its effectiveness on tensor contraction expressions for coupled cluster equations.


Journal of Physical Chemistry A | 2009

Performance Optimization of Tensor Contraction Expressions for Many Body Methods in Quantum Chemistry

Albert Hartono; Qingda Lu; Thomas Henretty; Sriram Krishnamoorthy; huaijian zhang; Gerald Baumgartner; David E. Bernholdt; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan

Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization with common subexpression identification and demonstrates its effectiveness on tensor contraction expressions for coupled cluster equations. The second part of the paper highlights the importance of data layout transformation in the optimization of tensor contraction computations on modern processors. A number of considerations, such as minimization of cache misses and utilization of multimedia vector instructions, are discussed. A library for efficient index permutation of multidimensional tensors is described, and experimental performance data is provided that demonstrates its effectiveness.


international parallel and distributed processing symposium | 2008

Towards effective automatic parallelization for multicore systems

Uday Bondhugula; Muthu Manikandan Baskaran; Albert Hartono; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan

The ubiquity of multicore processors in commodity computing systems has raised a significant programming challenge for their effective use. An attractive but challenging approach is automatic parallelization of sequential codes. Although virtually all production C compilers have automatic shared-memory parallelization capability, it is rarely used in practice by application developers because of limited effectiveness. In this paper we describe our recent efforts towards developing an effective automatic parallelization system that uses a polyhedral model for data dependences and program transformations.


international conference on computational science | 2009

Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes

Boyana Norris; Albert Hartono; Elizabeth R. Jessup; Jeremy G. Siek

The development of optimized codes is time-consuming and requires extensive architecture, compiler, and language expertise, therefore, computational scientists are often forced to choose between investing considerable time in tuning code or accepting lower performance. In this paper, we describe the first steps toward a fully automated system for the optimization of the matrix algebra kernels that are a foundational part of many scientific applications. To generate highly optimized code from a high-level MATLAB prototype, we define a three-step approach. To begin, we have developed a compiler that converts a MATLAB script into simple C code. We then use the polyhedral optimization system Pluto to optimize that code for coarse-grained parallelism and locality simultaneously. Finally, we annotate the resulting code with performance-tuning directives and use the empirical performance-tuning system Orio to generate many tuned versions of the same operation using different optimization techniques, such as loop unrolling and memory alignment. Orio performs an automated empirical search to select the best among the multiple optimized code variants. We discuss performance results on two architectures.

Collaboration


Dive into the Albert Hartono's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

J. Ramanujam

Louisiana State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sriram Krishnamoorthy

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar

David E. Bernholdt

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge