Akshay Venkatesh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Akshay Venkatesh is active.

Explore More

Publication

Featured researches published by Akshay Venkatesh.

international conference on parallel processing | 2013

Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs

Sreeram Potluri; Khaled Hamidouche; Akshay Venkatesh; Devendar Bureddy; Dhabaleswar K. Panda

GPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck in harnessing the full potential of a GPU. Data in the GPU memory has to be moved into the host memory before it can be sent over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using techniques like pipelining. GPUDirect RDMA is a feature introduced in CUDA 5.0, that allows third party devices like network adapters to directly access data in GPU device memory, over the PCIe bus. NVIDIA has partnered with Mellanox to make this solution available for InfiniBand clusters. In this paper, we evaluate the first version of GPUDirect RDMA for InfiniBand and propose designs in MVAPICH2 MPI library to efficiently take advantage of this feature. We highlight the limitations posed by current generation architectures in effectively using GPUDirect RDMA and address these issues through novel designs in MVAPICH2. To the best of our knowledge, this is the first work to demonstrate a solution for internode GPU-to-GPU MPI communication using GPUDirect RDMA. Results show that the proposed designs improve the latency of internode GPU-to-GPU communication using MPI Send/MPI Recv by 69% and 32% for 4Byte and 128KByte messages, respectively. The designs boost the uni-directional bandwidth achieved using 4KByte and 64KByte messages by 2x and 35%, respectively. We demonstrate the impact of the proposed designs using two end-applications: LBMGPU and AWP-ODC. They improve the communication times in these applications by up to 35% and 40%, respectively.

ieee international conference on high performance computing data and analytics | 2013

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

Sreeram Potluri; Devendar Bureddy; Khaled Hamidouche; Akshay Venkatesh; Krishna Chaitanya Kandalla; Hari Subramoni; Dhabaleswar K. Panda

Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.

ieee/acm international symposium cluster, cloud and grid computing | 2013

Efficient Intra-node Communication on Intel-MIC Clusters

Sreeram Potluri; Akshay Venkatesh; Devendar Bureddy; Krishna Chaitanya Kandalla; Dhabaleswar K. Panda

Accelerators and coprocessors have become a key component in modern supercomputing systems due to the superior performance per watt that they offer. Intels Xeon Phi coprocessor packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. This makes it an attractive choice for accelerating HPC applications. The Xeon Phi provides several channels for communication between MPI processes running on the coprocessor and the host. While supporting POSIX shared memory within the coprocessor, it exposes a low level API called the Symmetric Communication Interface (SCIF) that gives direct control of the DMA engine to the user. SCIF can also be used for communication between the coprocessor and the host. Xeon Phi also provides an implementation of the InfiniBand (IB) Verbs interface that enables a direct communication link with the InfiniBand adapter for communication between the coprocessor and the host. In this paper, we propose and evaluate design alternatives for efficient communication on a node with Xeon Phi coprocessor. We incorporate our designs in the popular MVAPICH2 MPI library. We use shared memory, IB Verbs and SCIF to design a hybrid solution that improves the MPI communication latency from Xeon Phi to the Host by 70%, for 4MByte messages, compared to an out-of-the-box version of MVAPICH2. Our solution delivers more than 6x improvement in peak uni-directional bandwidth from Xeon Phi to the Host and more than 3x improvement in bi-directional bandwidth. Through our designs, we are able to improve the performance of 16 process Gather, Alltoall and All gather collective operations by 70%, 85% and 80%, respectively, for 4MB messages. We further evaluate our designs using application benchmarks and show improvements of up to 18% with a 3D Stencil kernel and up to 11.5% with the P3DFFT library.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

OMB-GPU: a micro-benchmark suite for evaluating MPI libraries on GPU clusters

Devendar Bureddy; Hao Wang; Akshay Venkatesh; Sreeram Potluri; Dhabaleswar K. Panda

General-Purpose Graphics Processing Units (GPGPUs) are becoming a common component of modern supercomputing systems. Many MPI applications are being modified to take advantage of the superior compute potential offered by GPUs. To facilitate this process, many MPI libraries are being extended to support MPI communication from GPU device memory. However, there is lack of a standardized benchmark suite that helps users evaluate common communication models on GPU clusters and do a fair comparison for different MPI libraries. In this paper, we extend the widely used OSU Micro-Benchmarks (OMB) suite with benchmarks that evaluate performance of point-point, multi-pair and collective MPI communication for different GPU cluster configurations. Benefits of the proposed benchmarks for MVAPICH2 and OpenMPI libraries are illustrated.

high performance interconnects | 2013

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Krishna Chaitanya Kandalla; Akshay Venkatesh; Khaled Hamidouche; Sreeram Potluri; Devendar Bureddy; Dhabaleswar K. Panda

The emergence of co-processors such as Intel Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Furthermore, the communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.

ieee international conference on high performance computing data and analytics | 2015

A case for application-oblivious energy-efficient MPI runtime

Akshay Venkatesh; Abhinav Vishnu; Khaled Hamidouche; Nathan R. Tallent; Dhabaleswar K. Panda; Darren J. Kerbyson; Adolfy Hoisie

Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the applications performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.

OpenSHMEM 2014 Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356 | 2014

A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters

Jithin Jose; Jie Zhang; Akshay Venkatesh; Sreeram Potluri; Dhabaleswar K. Panda

OpenSHMEM is an open standard that brings together several long-standing, vendor-specific SHMEM implementations that allows applications to use SHMEM in a platform-independent fashion. Several implementations of OpenSHMEM have become available on clusters interconnected by InfiniBand networks, which has gradually become the de facto high performance network interconnect standard. In this paper, we present a detailed comparison and analysis of the performance of different OpenSHMEM implementations, using micro-benchmarks and application kernels. This study, done on TACC Stampede system using up to 4,096 cores, provides a useful guide for application developers to understand and contrast various implementations and to select the one that works best for their applications.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Evaluation of Energy Characteristics of MPI Communication Primitives with RAPL

Akshay Venkatesh; Krishna Chaitanya Kandalla; Dhabaleswar K. Panda

The energy consumed by modern supercomputing systems continues to grow at an alarming rate. The Message Passing Interface (MPI) has been the de facto programming model for parallel applications and MPI libraries have been designed to achieve the best communication performance on modern architectures. However, the performance and energy trade-offs of these designs have not been studied. Hence, it is critical to understand the energy consumption characteristics of MPI routines and the performance-energy trade-offs of various protocols and designs that are used in MPI libraries. The first hurdle in achieving this objective is to design a framework that can be used to measure energy consumption of various components during communication operations. The RAPL interface allows users to measure energy across various domains on the Intel Sandy-Bridge processor, in a low-overhead, non-intrusive manner. However, this interface has certain limitations and cannot be directly used to measure energy profiles of MPI operations in a fine-grained manner. In this paper, we propose a novel methodology to address these limitations. We propose a new shared-memory window-based solution to accurately measure the aggregate energy consumed by all processes engaged in MPI operations. Using our proposed framework, we demonstrate the impact of various communication protocols and progress mechanisms on the energy consumption. Our evaluations demonstrate that the kernel-based solutions can potentially lead to lower energy consumption for intra-node communication operations. Further, our framework also reveals possible energy bottlenecks in scaling important collective operations, such as, MPI All reduce. In addition, we also use our proposed framework to study the energy consumption characteristics of MPI calls in the NAS-IS benchmark and we infer that the choice of progress mechanism can lead to about 6% energy savings for the processors.

ieee international conference on high performance computing, data, and analytics | 2014

A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters

Akshay Venkatesh; Hari Subramoni; Khaled Hamidouche; Dhabaleswar K. Panda

Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.

high performance distributed computing | 2014

MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture

Raghunath Rajachandrasekar; Sreeram Potluri; Akshay Venkatesh; Khaled Hamidouche; Md. Wasi-ur-Rahman; Dhabaleswar K. Panda

The advent of many-core architectures like Intel MIC is enabling the design of increasingly capable supercomputers within reasonable power budgets. Fault-tolerance is becoming more important with the increased number of components and the complexity in these heterogeneous clusters. Checkpoint-restart mechanisms have been traditionally used to enhance the dependability of applications, and to enable dynamic task rescheduling in the face of system failures. Naive checkpointing protocols, which are predominantly I/O-intensive, face severe performance bottlenecks on the Xeon Phi architecture due to several inherent and acquired limitations. Consequently, existing checkpointing frameworks are not capable of serving distributed MPI applications that leverage heterogeneous hardware architectures. This paper discusses the I/O limitations on the Xeon Phi system, and describes the architecture and design of a novel distributed checkpointing framework, namely MIC-Check, for HPC applications running on it.

Explore More