Devendar Bureddy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Devendar Bureddy is active.

Explore More

Publication

Featured researches published by Devendar Bureddy.

international conference on parallel processing | 2013

Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs

Sreeram Potluri; Khaled Hamidouche; Akshay Venkatesh; Devendar Bureddy; Dhabaleswar K. Panda

GPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck in harnessing the full potential of a GPU. Data in the GPU memory has to be moved into the host memory before it can be sent over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using techniques like pipelining. GPUDirect RDMA is a feature introduced in CUDA 5.0, that allows third party devices like network adapters to directly access data in GPU device memory, over the PCIe bus. NVIDIA has partnered with Mellanox to make this solution available for InfiniBand clusters. In this paper, we evaluate the first version of GPUDirect RDMA for InfiniBand and propose designs in MVAPICH2 MPI library to efficiently take advantage of this feature. We highlight the limitations posed by current generation architectures in effectively using GPUDirect RDMA and address these issues through novel designs in MVAPICH2. To the best of our knowledge, this is the first work to demonstrate a solution for internode GPU-to-GPU MPI communication using GPUDirect RDMA. Results show that the proposed designs improve the latency of internode GPU-to-GPU communication using MPI Send/MPI Recv by 69% and 32% for 4Byte and 128KByte messages, respectively. The designs boost the uni-directional bandwidth achieved using 4KByte and 64KByte messages by 2x and 35%, respectively. We demonstrate the impact of the proposed designs using two end-applications: LBMGPU and AWP-ODC. They improve the communication times in these applications by up to 35% and 40%, respectively.

international parallel and distributed processing symposium | 2012

Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication

Sreeram Potluri; Hao Wang; Devendar Bureddy; Ashish Kumar Singh; Carlos Rosales; Dhabaleswar K. Panda

Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.

IEEE Transactions on Parallel and Distributed Systems | 2014

GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation

Hao Wang; Sreeram Potluri; Devendar Bureddy; Carlos Rosales; Dhabaleswar K. Panda

Designing high-performance and scalable applications on GPU clusters requires tackling several challenges. The key challenge is the separate host memory and device memory, which requires programmers to use multiple programming models, such as CUDA and MPI, to operate on data in different memory spaces. This challenge becomes more difficult to tackle when non-contiguous data in multidimensional structures is used by real-world applications. These challenges limit the programming productivity and the application performance. We propose the GPU-Aware MPI to support data communication from GPU to GPU using standard MPI. It unifies the separate memory spaces, and avoids explicit CPU-GPU data movement and CPU/GPU buffer management. It supports all MPI datatypes on device memory with two algorithms: a GPU datatype vectorization algorithm and a vector based GPU kernel data pack and unpack algorithm. A pipeline is designed to overlap the non-contiguous data packing and unpacking on GPUs, the data movement on the PCIe, and the RDMA data transfer on the network. We incorporate our design with the open-source MPI library MVAPICH2 and optimize a production application: the multiphase 3D LBM. Besides the increase of programming productivity, we observe up to 19.9 percent improvement in application-level performance on 64 GPUs of the Oakley supercomputer.

ieee international conference on high performance computing data and analytics | 2013

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

Sreeram Potluri; Devendar Bureddy; Khaled Hamidouche; Akshay Venkatesh; Krishna Chaitanya Kandalla; Hari Subramoni; Dhabaleswar K. Panda

Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.

ieee/acm international symposium cluster, cloud and grid computing | 2013

Efficient Intra-node Communication on Intel-MIC Clusters

Sreeram Potluri; Akshay Venkatesh; Devendar Bureddy; Krishna Chaitanya Kandalla; Dhabaleswar K. Panda

Accelerators and coprocessors have become a key component in modern supercomputing systems due to the superior performance per watt that they offer. Intels Xeon Phi coprocessor packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. This makes it an attractive choice for accelerating HPC applications. The Xeon Phi provides several channels for communication between MPI processes running on the coprocessor and the host. While supporting POSIX shared memory within the coprocessor, it exposes a low level API called the Symmetric Communication Interface (SCIF) that gives direct control of the DMA engine to the user. SCIF can also be used for communication between the coprocessor and the host. Xeon Phi also provides an implementation of the InfiniBand (IB) Verbs interface that enables a direct communication link with the InfiniBand adapter for communication between the coprocessor and the host. In this paper, we propose and evaluate design alternatives for efficient communication on a node with Xeon Phi coprocessor. We incorporate our designs in the popular MVAPICH2 MPI library. We use shared memory, IB Verbs and SCIF to design a hybrid solution that improves the MPI communication latency from Xeon Phi to the Host by 70%, for 4MByte messages, compared to an out-of-the-box version of MVAPICH2. Our solution delivers more than 6x improvement in peak uni-directional bandwidth from Xeon Phi to the Host and more than 3x improvement in bi-directional bandwidth. Through our designs, we are able to improve the performance of 16 process Gather, Alltoall and All gather collective operations by 70%, 85% and 80%, respectively, for 4MB messages. We further evaluate our designs using application benchmarks and show improvements of up to 18% with a 3D Stencil kernel and up to 11.5% with the P3DFFT library.

international parallel and distributed processing symposium | 2013

Extending OpenSHMEM for GPU Computing

Sreeram Potluri; Devendar Bureddy; Hao Wang; Hari Subramoni; Dhabaleswar K. Panda

Graphics Processing Units (GPUs) are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. In order to maximize utilization, it is imperative that applications running on these clusters have low synchronization and communication overheads. Partitioned Global Address Space (PGAS) models provide an attractive approach for developing parallel scientific applications. Such models simplify programming through the abstraction of a shared memory address space while their one-sided communication primitives allow for efficient implementation of applications with minimum synchronization. OpenSHMEM is a library-based programming model that is gaining popularity. However, the current OpenSHMEM standard does not support direct communication from GPU device buffers. It requires data to be copied to the host memory before OpenSHMEM calls can be made. Similarly, data has to moved to the GPU explicitly by remote processes. This severely limits the programmability and performance of GPU applications. In this paper we provide extensions to the OpenSHMEM model which allow communication calls to be made directly on the GPU memory. The proposed extensions are interoperable with the two most popular GPU programming frameworks: CUDA and OpenCL. We present designs for an efficient OpenSHMEM runtime which transparently provide high-performance communication between GPUs in different inter-node and intra-node configurations. To the best of our knowledge this is the first work that enables GPU-GPU communication using the OpenSHMEM model for both CUDA and OpenCL computing frameworks. The proposed extensions to OpenSHMEM, coupled with the high-performance runtime, improve the latency of GPU-GPU shmem getmem operation by 90%, 40% and 17%, for intra-IOH (I/O Hub), inter-IOH and inter-node configurations. It improves the performance of OpenSHMEM atomics by up to 55% and 52%, for intra-IOH and inter-node GPU configurations respectively. The proposed enhancements improve the performance of Stencil2D kernel by 65% on a cluster of 192 GPUs and the performance of BFS kernel by 12% on a cluster of 96 GPUs.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

OMB-GPU: a micro-benchmark suite for evaluating MPI libraries on GPU clusters

Devendar Bureddy; Hao Wang; Akshay Venkatesh; Sreeram Potluri; Dhabaleswar K. Panda

General-Purpose Graphics Processing Units (GPGPUs) are becoming a common component of modern supercomputing systems. Many MPI applications are being modified to take advantage of the superior compute potential offered by GPUs. To facilitate this process, many MPI libraries are being extended to support MPI communication from GPU device memory. However, there is lack of a standardized benchmark suite that helps users evaluate common communication models on GPU clusters and do a fair comparison for different MPI libraries. In this paper, we extend the widely used OSU Micro-Benchmarks (OMB) suite with benchmarks that evaluate performance of point-point, multi-pair and collective MPI communication for different GPU cluster configurations. Benefits of the proposed benchmarks for MVAPICH2 and OpenMPI libraries are illustrated.

high performance interconnects | 2013

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Krishna Chaitanya Kandalla; Akshay Venkatesh; Khaled Hamidouche; Sreeram Potluri; Devendar Bureddy; Dhabaleswar K. Panda

The emergence of co-processors such as Intel Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Furthermore, the communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.

2013 Extreme Scaling Workshop (xsw 2013) | 2013

MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand

Sreeram Potluri; Khaled Hamidouche; Devendar Bureddy; Dhabaleswar K. Panda

Intels Xeon Phi coprocessor, based on Many Integrated Core architecture, packs more than 1TFLOP of performance on a single chip and offers x86 compatibility. While MPI libraries can run out-of-the-box on the Xeon Phi coprocessors, it is critical to tune them for the new architecture and to redesign them using any new system level features offered in order to deliver performance. In this paper, we discuss the tuning and redesign of the MVAPICH2 MPI library for efficient intra-node and inter-node point-to-point communication on XeonPhi clusters with InfiniBand. We evaluate the designs using micro-benchmarks and application kernels. The results show significant improvements in performance of intra-MIC, intranode and internode communication. For the internode MIC-MIC path, the latency of 4M messages is reduced by 65% and the bandwidth for the same message size is improved by 5 times. The designs show 50% and 16% improvement in performance of 3DStencil communication kernel and P3DFFT library on 32 and 8 nodes, respectively. We discuss the challenges involved in providing a further optimized MVAPICH2 MPI library for Xeon Phi clusters.

international conference on cluster computing | 2013

Design of network topology aware scheduling services for large InfiniBand clusters

Hari Subramoni; Devendar Bureddy; Krishna Chaitanya Kandalla; Karl W. Schulz; Bill Barth; Jonathan L. Perkins; Mark Daniel Arnold; Dhabaleswar K. Panda

The goal of any scheduler is to satisfy users demands for computation and achieve a good performance in overall system utilization by efficiently assigning jobs to resources. However, the current state-of-the-art scheduling techniques do not intelligently balance node allocation based on the total bandwidth available between switches - that leads to over subscription. Additionally, poor placement of processes can lead to network congestion and poor performance. In this paper, we explore the design of a network-topology-aware plugin for the SLURM job scheduler for modern InfiniBand-based clusters. We present designs to enhance the performance of applications with varying communication characteristics. Through our techniques, we are able to considerably reduce the amount of network contention observed during the Alltoall / FFT operations. The results of our experimental evaluation indicate that our proposed technique is able to deliver up to a 9% improvement in the communication time of P3DFFT at 512 processes. We also see that our techniques are able to increase the performance of microbenchmarks that rely on point-to-point operations up to 40% for all message sizes. Our techniques were also able to improve the throughput of a 512-core cluster by up to 8%.

Explore More