K. V. R. Murthy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where K. V. R. Murthy is active.

Explore More

Publication

Featured researches published by K. V. R. Murthy.

ieee international conference on high performance computing data and analytics | 2013

Effective sampling-driven performance tools for GPU-accelerated supercomputers

Milind Chabbi; K. V. R. Murthy; Mike Fagan; John M. Mellor-Crummey

Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling-based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.

international parallel and distributed processing symposium | 2013

Managing Asynchronous Operations in Coarray Fortran 2.0

Chaoran Yang; K. V. R. Murthy; John M. Mellor-Crummey

As the gap between processor speed and network latency continues to increase, avoiding exposed communication latency is critical for high performance on modern supercomputers. One can hide communication latency by overlapping it with computation using non-blocking data transfers, or avoid exposing communication latency by moving computation to the location of data it manipulates. Coarray Fortran 2.0 (CAF 2.0) - a partitioned global address space language - provides a rich set of asynchronous operations for avoiding exposed latency including asynchronous copies, function shipping, and asynchronous collectives. CAF 2.0 provides event variables to manage completion of asynchronous operations that use explicit completion. This paper describes CAF 2.0s finish and cofence synchronization constructs, which enable one to manage implicit completion of asynchronous operations. finish ensures global completion of a set of asynchronous operations across the members of a team. Because of CAF 2.0s SPMD model, its semantics and implementation of finish differ significantly from those of finish in X10 and HabaneroC. cofence controls local data completion of implicitlysynchronized asynchronous operations. Together these constructs provide the ability to tune a programs performance by exploiting the difference between local data completion, local operation completion, and global completion of asynchronous operations, while hiding network latency. We explore subtle interactions between cofence, finish, events, asynchronous copies and collectives, and function shipping. We justify their presence in a relaxed memory model for CAF 2.0. We demonstrate the utility of these constructs in the context of two benchmarks: Unbalanced Tree Search (UTS), and HPC Challenge RandomAccess. We achieve 74%-77% parallel efficiency for 4K-32K cores for UTS using the T1WL spec, which demonstrates scalable performance using our synchronization constructs. Our cofence micro-benchmark shows that for a producer-consumer scenario, using local data completion rather than local operation completion yields superior performance.

IEEE Transactions on Parallel and Distributed Systems | 2016

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Ashwin M. Aji; Lokendra S. Panwar; Feng Ji; K. V. R. Murthy; Milind Chabbi; Pavan Balaji; Keith R. Bisset; James Dinan; Wu-chun Feng; John M. Mellor-Crummey; Xiaosong Ma; Rajeev Thakur

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACCs runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

irregular applications: architectures and algorithms | 2016

Optimized distributed work-stealing

Vivek A. Kumar; K. V. R. Murthy; Vivek Sarkar; Yili Zheng

Work-stealing is a popular approach for dynamic load balancing of task-parallel programs. However, as has been widely studied, the use of classical work-stealing algorithms on massively parallel and distributed supercomputers introduces several performance issues. One such issue is the overhead of failed steals (communicating with a victim that has no work), which is far more severe in the distributed context than within a single SMP node. Due to the cost of inter-node communication, it is critical to reduce the number of failed steals in a distributed context. Prior work has demonstrated that load-aware victim processor selection can reduce the number of failed steals, but it cannot eliminate the failed steals completely.In this paper, we present two different load-aware implementations of distributed work-stealing algorithm in HabaneroUPC++ PGAS library — BaselineWS and SuccessOnlyWS. BaselineWS follows prior work in implementing a distributed work-stealing strategy. SuccessOnlyWS implements a novel distributed work-stealing strategy that completely eliminate inter-node failed attempts by introducing a new policy for moving work from busy to idle processors. This strategy also avoids querying the same processor multiple times with failed steals. We evaluate both BaselineWS and SuccessOnlyWS by using up to 12288 cores of Edison, a CRAY-XC30 supercomputer and by using dynamic irregular applications, as exemplified by the UTS and NQueens benchmarks. We demonstrate that SuccessOnlyWS provides performance improvements up to 7% over BaselineWS.

International Scholarly Research Notices | 2011

Rare Earth Doped Alkali Earth Sulfide Phosphors for White-Light LEDs

Kuthuru Suresh; K. V. R. Murthy; Ch. Atchyutha Rao; N.V. Poornachandra Rao

CaS:Eu and SrS:Eu phosphors were synthesized by solid-state reaction. The effects of doping concentrations on luminescent properties of phosphors are investigated. The samples are excited using electroluminescent blue light emitting diode (460 nm) to examine them as potential coating phosphors for white-light LEDs. The excitation and emission spectra of these phosphors are broadband which can be viewed as the typical emission of Eu2

2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015

A Compiler Transformation to Overlap Communication with Dependent Computation

K. V. R. Murthy; John M. Mellor-Crummey

Hiding communication latency is essential to achieve scalable performance on current and future parallel systems. In this extended abstract, we present a novel compiler transformation that overlaps communication with computation to hide communication latency. Unlike prior work, we are able to achieve this overlap even in the presence of an overlap-inhibiting data dependence between the communication and computation. We do so by transforming the data dependence into an overlap-amenable one. To achieve this overlap, the Maunam compiler transforms the code by employing array expansion, partial loop peeling, loop alignment, and array contraction. This transformation is useful for optimization of systolic, communication avoiding algorithms.

international conference on parallel architectures and compilation techniques | 2015

Communication Avoiding Algorithms: Analysis and Code Generation for Parallel Systems

K. V. R. Murthy; John M. Mellor-Crummey

Data movement is a critical bottleneck for future generations of parallel systems. The class of .5D communication-avoiding algorithms were developed to address this bottleneck. These algorithms reduce communication and provide strong scaling in both time and energy. As a firststep towards automating the development of communication-avoiding-libraries, we developed the Maunam compiler. Maunam generates efficient parallel code from a high-level, global view sketch of .5D algorithms that are expressed using symbolic data sizes and numbers of processors. It supports the expression of data movement and communication through-high-level global operations such as TILT and CSHIFT as well as through element-wise copy operations. With the latter, wrap around communication patterns can also be achieved using subscripts based on modulo operations. Maunam employs polyhedral analysis to reason about communication and computation present in a .5D algorithm. After partitioning data and computation, it inserts point-to-point-and collective communication as needed. Maunam also analyzes data dependence patterns and data layouts to identify reductions over processor subsets. Maunam-generated Fortran+MPI code for 2.5D matrix multiplication running on 4096 cores of a Cray XC30 super computer achieves 59 TFlops/s (76% of the machine peak). Our generated parallel code achieves 91% of the performance of a hand-coded version.

high performance distributed computing | 2013

On the efficacy of GPU-integrated MPI for scientific applications

Ashwin M. Aji; Lokendra S. Panwar; Feng Ji; Milind Chabbi; K. V. R. Murthy; Pavan Balaji; Keith R. Bisset; James Dinan; Wu-chun Feng; John M. Mellor-Crummey; Xiaosong Ma; Rajeev Thakur

Radiation Protection Dosimetry | 2006