Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Putt Sakdhnagool is active.

Publication


Featured researches published by Putt Sakdhnagool.


languages and compilers for parallel computing | 2014

Evaluating Performance Portability of OpenACC

Amit Sabne; Putt Sakdhnagool; Seyong Lee; Jeffrey S. Vetter

Accelerator-based heterogeneous computing is gaining momentum in High Performance Computing arena. However, the increased complexity of the accelerator architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle the problem. While the abstraction endowed by OpenACC offers productivity, it raises questions on its portability. This paper evaluates the performance portability obtained by OpenACC on twelve OpenACC programs on NVIDIA CUDA, AMD GCN, and Intel MIC architectures. We study the effects of various compiler optimizations and OpenACC program settings on these architectures to provide insights into the achieved performance portability.


acm sigplan symposium on principles and practice of parallel programming | 2017

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Tsung Tai Yeh; Amit Sabne; Putt Sakdhnagool; Rudolf Eigenmann; Timothy G. Rogers

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.


high performance distributed computing | 2015

HeteroDoop: A MapReduce Programming System for Accelerator Clusters

Amit Sabne; Putt Sakdhnagool; Rudolf Eigenmann

The deluge of data has inspired big-data processing frameworks that span across large clusters. Frameworks for MapReduce, a state-of-the-art programming model, have primarily made use of the CPUs in distributed systems, leaving out computationally powerful accelerators such as GPUs. This paper presents HeteroDoop, a MapReduce framework that employs both CPUs and GPUs in a cluster. HeteroDoop offers the following novel features: (i) a small set of directives can be placed on an existing sequential, CPU-only program, expressing MapReduce semantics; (ii) an optimizing compiler translates the directive-augmented program into a GPU code; (iii) a runtime system assists the compiler in handling MapReduce semantics on the GPU; and (iv) a tail scheduling scheme minimizes job execution time in light of disparate processing capabilities of CPUs and GPUs. This paper addresses several challenges that need to be overcome in order to support these features. HeteroDoop is built on top of the state-of-the-art, CPU-only Hadoop MapReduce framework, inheriting its functionality. Evaluation results of HeteroDoop on recent hardware indicate that usage of even a single GPU per node can improve performance by up to 2.78x, with a geometric mean of 1.6x across our benchmarks, compared to a CPU-only Hadoop, running on a cluster with 20-core CPUs.


international conference on supercomputing | 2013

Scaling large-data computations on multi-GPU accelerators

Amit Sabne; Putt Sakdhnagool; Rudolf Eigenmann

Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run out-of-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.


ieee international conference on high performance computing data and analytics | 2017

Massively parallel 3D image reconstruction

Xiao Wang; Amit Sabne; Putt Sakdhnagool; Sherman J. Kisner; Charles A. Bouman; Samuel P. Midkiff

Computed Tomographic (CT) image reconstruction is an important technique used in a wide range of applications. Among reconstruction methods, Model-Based Iterative Reconstruction (MBIR) is known to produce much higher quality CT images; however, the high computational requirements of MBIR greatly restrict their application. Currently, MBIR speed is primarily limited by irregular data access patterns, the difficulty of effective parallelization, and slow algorithmic convergence. This paper presents a new algorithm for MBIR, the Non-Uniform Parallel Super-Voxel (NU-PSV) algorithm, that regularizes the data access pattern, enables massive parallelism, and ensures fast convergence. We compare the NU-PSV algorithm with two state-of-the-art implementations on a 69632-core distributed system. Results indicate that the NU-PSV algorithm has an average speedup of 1665 compared to the fastest state-of-the-art implementations.


IEEE Micro | 2015

Understanding Portability of a High-Level Programming Model on Contemporary Heterogeneous Architectures

Amit Sabne; Putt Sakdhnagool; Seyong Lee; Jeffrey S. Vetter

Accelerator-based heterogeneous computing is gaining momentum in the high-performance computing arena. However, the increased complexity of heterogeneous architectures demands more generic, high-level programming models. OpenACC is one such attempt to tackle this problem. Although the abstraction provided by OpenACC offers productivity, it raises questions concerning both functional and performance portability. In this article, the authors propose HeteroIR, a high-level, architecture-independent intermediate representation, to map high-level programming models, such as OpenACC, to heterogeneous architectures. They present a compiler approach that translates OpenACC programs into HeteroIR and accelerator kernels to obtain OpenACC functional portability. They then evaluate the performance portability obtained by OpenACC with their approach on 12 OpenACC programs on Nvidia CUDA, AMD GCN, and Intel Xeon Phi architectures. They study the effects of various compiler optimizations and OpenACC program settings on these architectures to provide insights into the achieved performance portability.


languages and compilers for parallel computing | 2015

HYDRA: Extending Shared Address Programming for Accelerator Clusters

Putt Sakdhnagool; Amit Sabne; Rudolf Eigenmann

This work extends shared address programming to accelerator clusters by pursuing a simple form of shared-address programming, named HYDRA, where the programmer only specifies the parallel regions in the program. We present a fully automatic translation system that generates an MPI + accelerator program from a HYDRA program. Our mechanism ensures scalability of the generated program by optimizing data placement and transfer to and from the limited, discrete memories of accelerator devices. We also present a compiler design built on a high-level IR to support multiple accelerator architectures. Evaluation results demonstrate the scalability of the translated programs on five well-known benchmarks. On average, HYDRA gains a 24.54x speedup over single-accelerator performance when running on a 64-node Intel Xeon Phi cluster and a 27.56x speedup when running on a 64-node NVIDIA GPU cluster.


international workshop on openmp | 2012

Effects of compiler optimizations in OpenMP to CUDA translation

Amit Sabne; Putt Sakdhnagool; Rudolf Eigenmann

One thrust of the OpenMP standard development focuses on support for accelerators. An important question is whether or not OpenMP extensions are needed, and how much performance difference they would make. The same question is relevant for related efforts in support of accelerators, such as OpenACC. The present paper pursues this question. We analyze the effects of individual optimization techniques in a previously developed system that translates OpenMP programs into GPU codes, called OpenMPC. We also propose a new tuning strategy, called Modified IE (MIE), which overcomes some inefficiencies of the original OpenMPC tuning scheme. Furthermore, MIE addresses the challenge of tuning in the presence of runtime variations, owing to the memory transfers between the CPU and GPU. MIE, on average, performs 11% better than the previous tuning system while restricting the tuning system time complexity to a polynomial function.


languages and compilers for parallel computing | 2016

Formalizing Structured Control Flow Graphs

Amit Sabne; Putt Sakdhnagool; Rudolf Eigenmann

Structured programs are believed to be easier to understand, and compiler friendly [5, 10, 45]. However, compilers do not process the source programs directly; they instead work on control flow graphs (CFGs) of the programs. Unfortunately, there is little formalization of structured CFGs. This paper shows how the lack of formalization has led to varying interpretations of structured CFGs. The paper next presents new formalization of structured CFGs which eliminates the ambiguity. Structured CFGs gain importance as they ease compiler optimizations, decompilation, and help reduce the performance degradation caused by thread divergence on SIMD units. The paper elaborates on these benefits. It also shows that compilers, both front-ends and back-ends, may generate unstructured CFGs from structured program sources, which necessitates mechanisms to obtain structured CFGs from unstructured ones.


international conference on parallel architectures and compilation techniques | 2016

POSTER: Pagoda: A Runtime System to Maximize GPU Utilization in Data Parallel Tasks with Limited Parallelism

Tsung Tai Yeh; Amit Sabne; Putt Sakdhnagool; Rudolf Eigenmann; Timothy G. Rogers

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.

Collaboration


Dive into the Putt Sakdhnagool's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jeffrey S. Vetter

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Seyong Lee

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Timothy G. Rogers

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge