Sunita Chandrasekaran

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sunita Chandrasekaran is active.

Explore More

Publication

Featured researches published by Sunita Chandrasekaran.

languages and compilers for parallel computing | 2013

Compiling a High-level Directive-Based Programming Model for GPGPUs

Xiaonan Tian; Rengan Xu; Yonghong Yan; Zhifeng Yun; Sunita Chandrasekaran; Barbara M. Chapman

OpenACC is an emerging directive-based programming model for programming accelerators that typically enable non-expert programmers to achieve portable and productive performance of their applications. In this paper, we present the research and development challenges, and our solutions to create an open-source OpenACC compiler in a main stream compiler framework (OpenUH of a branch of Open64). We discuss in details our loop mapping techniques, i.e. how to distribute loop iterations over the GPGPU’s threading architectures, as well as their impacts on performance. The runtime support of this programming model are also presented. The compiler was evaluated with several commonly used benchmarks, and delivered similar performance to those obtained using a commercial compiler. We hope this implementation to serve as compiler infrastructure for researchers to explore advanced compiler techniques, to extend OpenACC to other programming languages, or to build performance tools used with OpenACC programs.

ieee international conference on high performance computing data and analytics | 2008

Capturing performance knowledge for automated analysis

Kevin A. Huck; Oscar R. Hernandez; Van Bui; Sunita Chandrasekaran; Barbara M. Chapman; Allen D. Malony; Lois Curfman McInnes; Boyana Norris

Automating the process of parallel performance experimentation, analysis, and problem diagnosis can enhance environments for performance-directed application development, compilation, and execution. This is especially true when parametric studies, modeling, and optimization strategies require large amounts of data to be collected and processed for knowledge synthesis and reuse. This paper describes the integration of the PerfExplorer performance data mining framework with the OpenUH compiler infrastructure. OpenUH provides auto-instrumentation of source code for performance experimentation and PerfExplorer provides automated and reusable analysis of the performance data through a scripting interface. More importantly, PerfExplorer inference rules have been developed to recognize and diagnose performance characteristics important for optimization strategies and modeling. Three case studies are presented which show our success with automation in OpenMP and MPI code tuning, parametric characterization, Pand power modeling. The paper discusses how the integration supports performance knowledge engineering across applications and feedback-based compiler optimization in general.

ieee international conference on high performance computing data and analytics | 2014

SPEC ACCEL : a Standard Application Suite for Measuring Hardware Accelerator Performance

Guido Juckeland; William C. Brantley; Sunita Chandrasekaran; Barbara M. Chapman; Shuai Che; Mathew E. Colgrove; Huiyu Feng; Alexander Grund; Robert Henschel; Wen-mei W. Hwu; Huian Li; Matthias S. Müller; Wolfgang E. Nagel; Maxim Perminov; Pavel Shelepugin; Kevin Skadron; John A. Stratton; Alexey Titov; Ke Wang; G. Matthijs van Waveren; Brian Whitney; Sandra Wienke; Rengan Xu; Kalyan Kumaran

Hybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it difficult to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (HPG) has developed a set of performance metrics to evaluate the performance and power consumption of accelerators for various science applications. The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform. The first set of published results demonstrate the viability and relevance of the new metrics in comparing accelerator performance. This paper discusses the benchmark suites and selected published results in great detail.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model

Rengan Xu; Sunita Chandrasekaran; Barbara M. Chapman

Heterogeneous computing come with tremendous potential and is a leading candidate for scientific applications that are becoming more and more complex. Accelerators such as GPUs whose computing momentum is growing faster than ever offer application performance when compute intensive portions of an application are offloaded to them. It is quite evident that future computing architectures are moving towards hybrid systems consisting of multi-GPUs and multi-core CPUs. A variety of high-level languages and software tools can simplify programming these systems. Directive-based programming models are being embraced since they not only ease programming complex systems but also abstract low-level details from the programmer. We already know that OpenMP has been making programming CPUs easy and portable. Similarly, a directive-based programming model for accelerators is OpenACC that is gaining popularity since the directives play an important role in developing portable software for GPUs. A combination of OpenMP and OpenACC, a hybrid model, is a plausible solution to port scientific applications to heterogeneous architectures especially when there is more than one GPU on a single node to port an application to. However OpenACC meant for accelerators is yet to provide support for multi-GPUs. But using OpenMP we could conveniently exploit features such as for and section to distribute compute intensive kernels to more than one GPU. We demonstrate the effectiveness of this hybrid approach with some case studies in this paper.

international workshop on openmp | 2012

An OpenMP 3.1 validation testsuite

Cheng Wang; Sunita Chandrasekaran; Barbara M. Chapman

Parallel programming models are evolving so rapidly that it needs to be ensured that OpenMP can be used easily to program multicore devices. There is also effort involved in getting OpenMP to be accepted as a de facto standard in the embedded system community. However, in order to ensure correctness of OpenMPs implementation, there is a requirement of an up-to-date validation suite. In this paper, we present a portable and robust validation testsuite execution environment to validate the OpenMP implementation in several compilers. We cover all the directives and clauses of OpenMP until the latest release, OpenMP Version 3.1. Our primary focus is to determine and evaluate the correctness of the OpenMP implementation in our research compiler, OpenUH and few others such as Intel, Sun/Oracle and GNU. We also aim to find the ambiguities in the OpenMP specification and help refine the same with the validation suite. Furthermore, we also include deeper tests such as cross tests and orphan tests in the testsuite.

programming models and applications for multicores and manycores | 2013

libEOMP: a portable OpenMP runtime library based on MCA APIs for embedded systems

Cheng Wang; Sunita Chandrasekaran; Barbara M. Chapman; Jim Holt

In recent years rapid revolution of Multiprocessor System-on-Chip (MPSoC) poses new challenges for programming such architectures in an efficient manner. In order to explore potential hardware concurrency, software developers are still expected to handle many of the low-level details of programming including utilizing DMA, ensuring cache co-herency, and inserting synchronization primitives explicitly. Software portability is yet another issue: the state-of-the-art is that hardware vendors supply vendor-specific software development toolchains which makes it harder for applications to be ported to many different possible architectures without re-structuring the code, while at the same time ensuring efficiency. In this paper, we extend the usage of a high-level programming model, OpenMP, to multicore embedded systems. To address the architectural challenges, we propose a lightweight unified OpenMP runtime library, libEOMP, by leveraging the MCA (Multicore Association) APIs as the target of our OpenMP translation. MCA APIs support device-level communication and resource management for multicore embedded systems. We have implemented and evaluated libEOMP on an embedded platform supplied by Freescale Semiconductor. We observed that libEOMP not only performed as well as optimized vendor-specific OpenMP runtime libraries but also achieved better portability, programmability and productivity.

ieee international conference on high performance computing data and analytics | 2012

Energy Analysis of Parallel Scientific Kernels on Multiple GPUs

Sayan Ghosh; Sunita Chandrasekaran; Barbara M. Chapman

A dramatic improvement in energy efficiency is mandatory for sustainable supercomputing and has been identified as a major challenge. Affordable energy solution continues to be of great concern in the development of the next generation of supercomputers. Low power processors, dynamic control of processor frequency and heterogeneous systems are being proposed to mitigate energy costs. However, the entire software stack must be re-examined with respect to its ability to improve efficiency in terms of energy as well as performance. In order to address this need, a better understanding of the energy behavior of applications is essential. In this paper we explore the energy efficiency of some common kernels used in high performance computing on a multi-GPU platform, and compare our results with multicore CPUs. We implement these kernels using optimized libraries like FFTW, CUBLAS and MKL. Our experiments demonstrate a relationship between energy consumption and computation-communication factors of certain application kernels. In general, we observe that the correlation of energy consumption to GPU global memory accesses is 0.73 and power consumption to operations per unit time is 0.84, signifying a strong positive relationship between them. We believe that our results will assist the HPC community in understanding the power/energy behavior of scientific kernels on multi-GPU platforms.

languages and compilers for parallel computing | 2014

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Rengan Xu; Xiaonan Tian; Sunita Chandrasekaran; Yonghong Yan; Barbara M. Chapman

The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create higher-level programming approaches such as OpenACC. Directive-based programming models such as OpenMP and OpenACC offer programmers an option to rapidly create prototype applications by adding annotations to guide compiler optimizations. In this paper we study the effectiveness of a high-level directive based programming model, OpenACC, for parallelizing NAS Parallel Benchmarks (NPB) on GPGPUs. We present the application of techniques such as array privatization, memory coalescing, cache optimization and examine their impact on the performance of the benchmarks. The right choice or combination of techniques/hints are crucial for compilers to generate highly efficient codes tuned to a particular type of accelerator. Poorly selected choice or combination of techniques can lead to degraded performance. We also propose a new clause, ‘scan’, that handles scan operations for arbitrary input array size. We hope that the practices discussed in this paper will provide useful guidance to users to effectively migrate their sequential/CPU-parallel codes to GPGPU architectures and achieve optimal performance.

international parallel and distributed processing symposium | 2014

A Validation Testsuite for OpenACC 1.0

Cheng Wang; Rengan Xu; Sunita Chandrasekaran; Barbara M. Chapman; Oscar R. Hernandez

Directive-based programming models provide high-level of abstraction thus hiding complex low-level details of the underlying hardware from the programmer. One such model is OpenACC that is also a portable programming model allowing programmers to write applications that offload portions of work from a host CPU to an attached accelerator (GPU or a similar device). The model is gaining popularity and being used for accelerating many types of applications, ranging from molecular dynamics codes to particle physics models. It is critical to evaluate the correctness of the OpenACC implementations and determine its conformance to the specification. In this paper, we present a robust and scalable testing infrastructure that serves this purpose. We worked very closely with three main vendors that offer compiler support for OpenACC and assisted them in identifying and resolving compiler bugs helping them improve the quality of their compilers. The testsuite also aims to identify and resolve ambiguities within the OpenACC specification. This testsuite has been integrated into the harness infrastructure of the TITAN machine at Oak Ridge National Lab and is being used for production. The testsuite consists of test cases for all the directives and clauses of OpenACC, both for C and Fortran languages. The testsuite discussed in this paper focuses on the OpenACC 1.0 feature set. The framework of the testsuite is robust enough to create test cases for 2.0 and future releases. This work is in progress.

languages, compilers, and tools for embedded systems | 2013

Portable mapping of openMP to multicore embedded systems using MCA APIs

Cheng Wang; Sunita Chandrasekaran; Peng Sun; Barbara M. Chapman; Jim Holt

Multicore embedded systems are being widely used in telecommunication systems, robotics, medical applications and more.While they offer a high-performance with low-power solution, programming in an efficient way is still a challenge. In order to exploit the capabilities that the hardware offers, software developers are expected to handle many of the low-level details of programming including utilizing DMA, ensuring cache coherency, and inserting synchronization primitives explicitly. The state-of-the-art involves solutions where the software toolchain is too vendor-specific thus tying the software to a particular hardware leaving no room-for portability. In this paper we present a runtime system to explore mapping a high-level programming model, OpenMP, on to multicore embedded systems. A key feature of our scheme is that unlike the existing approaches that largely rely on POSIX threads, our approach leverages the Multicore Association (MCA) APIs as an OpenMP translation layer. The MCA APIs is a set of low-level APIs handling resource management, inter-process communications and task scheduling for multicore embedded systems. By deploying the MCA APIs, our runtime is able to effectively capture the characteristics of multicore embedded systems compared with the POSIX threads. Furthermore, the MCA layer enables our runtime implementation to be portable across various architectures. Thus programmers only need to maintain a single OpenMP code base which is compatible by various compilers, while on the other hand, the code is portable across different possible types of platforms. We have evaluated our runtime system using several embedded benchmarks. The experiments demonstrate promising and competitive performance compared to the native approach for the platform.

Explore More