Sreeram Potluri | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sreeram Potluri is active.

Explore More

Publication

Featured researches published by Sreeram Potluri.

Computer Science - Research and Development | 2011

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Hao Wang; Sreeram Potluri; Miao Luo; Ashish Kumar Singh; Sayantan Sur; Dhabaleswar K. Panda

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and programmer productivity. Applications executing on a cluster with GPUs have to manage data movement using CUDA in addition to MPI, the de-facto parallel programming standard. Currently, data movement with CUDA and MPI libraries is not integrated and it is not as efficient as possible. In addition, MPI-2 one sided communication does not work for windows in GPU memory, as there is no way to remotely get or put data from GPU memory in a one-sided manner.In this paper, we propose a novel MPI design that integrates CUDA data movement transparently with MPI. The programmer is presented with one MPI interface that can communicate to and from GPUs. Data movement from GPU and network can now be overlapped. The proposed design is incorporated into the MVAPICH2 library. To the best of our knowledge, this is the first work of its kind to enable advanced MPI features and optimized pipelining in a widely used MPI library. We observe up to 45% improvement in one-way latency. In addition, we show that collective communication performance can be improved significantly: 32%, 37% and 30% improvement for Scatter, Gather and Allotall collective operations, respectively. Further, we enable MPI-2 one sided communication with GPUs. We observe up to 45% improvement for Put and Get operations.

international conference on parallel processing | 2013

Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs

Sreeram Potluri; Khaled Hamidouche; Akshay Venkatesh; Devendar Bureddy; Dhabaleswar K. Panda

GPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck in harnessing the full potential of a GPU. Data in the GPU memory has to be moved into the host memory before it can be sent over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using techniques like pipelining. GPUDirect RDMA is a feature introduced in CUDA 5.0, that allows third party devices like network adapters to directly access data in GPU device memory, over the PCIe bus. NVIDIA has partnered with Mellanox to make this solution available for InfiniBand clusters. In this paper, we evaluate the first version of GPUDirect RDMA for InfiniBand and propose designs in MVAPICH2 MPI library to efficiently take advantage of this feature. We highlight the limitations posed by current generation architectures in effectively using GPUDirect RDMA and address these issues through novel designs in MVAPICH2. To the best of our knowledge, this is the first work to demonstrate a solution for internode GPU-to-GPU MPI communication using GPUDirect RDMA. Results show that the proposed designs improve the latency of internode GPU-to-GPU communication using MPI Send/MPI Recv by 69% and 32% for 4Byte and 128KByte messages, respectively. The designs boost the uni-directional bandwidth achieved using 4KByte and 64KByte messages by 2x and 35%, respectively. We demonstrate the impact of the proposed designs using two end-applications: LBMGPU and AWP-ODC. They improve the communication times in these applications by up to 35% and 40%, respectively.

international parallel and distributed processing symposium | 2012

Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication

Sreeram Potluri; Hao Wang; Devendar Bureddy; Ashish Kumar Singh; Carlos Rosales; Dhabaleswar K. Panda

Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.

IEEE Transactions on Parallel and Distributed Systems | 2014

GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation

Hao Wang; Sreeram Potluri; Devendar Bureddy; Carlos Rosales; Dhabaleswar K. Panda

Designing high-performance and scalable applications on GPU clusters requires tackling several challenges. The key challenge is the separate host memory and device memory, which requires programmers to use multiple programming models, such as CUDA and MPI, to operate on data in different memory spaces. This challenge becomes more difficult to tackle when non-contiguous data in multidimensional structures is used by real-world applications. These challenges limit the programming productivity and the application performance. We propose the GPU-Aware MPI to support data communication from GPU to GPU using standard MPI. It unifies the separate memory spaces, and avoids explicit CPU-GPU data movement and CPU/GPU buffer management. It supports all MPI datatypes on device memory with two algorithms: a GPU datatype vectorization algorithm and a vector based GPU kernel data pack and unpack algorithm. A pipeline is designed to overlap the non-contiguous data packing and unpacking on GPUs, the data movement on the PCIe, and the RDMA data transfer on the network. We incorporate our design with the open-source MPI library MVAPICH2 and optimize a production application: the multiphase 3D LBM. Besides the increase of programming productivity, we observe up to 19.9 percent improvement in application-level performance on 64 GPUs of the Oakley supercomputer.

ieee international conference on high performance computing data and analytics | 2012

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes

Hari Subramoni; Sreeram Potluri; Krishna Chaitanya Kandalla; Bill Barth; Jérôme Vienne; Jeff Keasler; Karen Tomko; Karl W. Schulz; Adam Moody; Dhabaleswar K. Panda

Over the last decade, InfiniBand has become an increasingly popular interconnect for deploying modern supercomputing systems. However, there exists no detection service that can discover the underlying network topology in a scalable manner and expose this information to runtime libraries and users of the high performance computing systems in a convenient way. In this paper, we design a novel and scalable method to detect the InfiniBand network topology by using Neighbor-Joining techniques (NJ). To the best of our knowledge, this is the first instance where the neighbor joining algorithm has been applied to solve the problem of detecting InfiniBand network topology. We also design a network-topology-aware MPI library that takes advantage of the network topology service. The library places processes taking part in the MPI job in a network-topology-aware manner with the dual aim of increasing intra-node communication and reducing the long distance inter-node communication across the InfiniBand fabric.

ieee international conference on high performance computing data and analytics | 2013

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

Sreeram Potluri; Devendar Bureddy; Khaled Hamidouche; Akshay Venkatesh; Krishna Chaitanya Kandalla; Hari Subramoni; Dhabaleswar K. Panda

Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.

international supercomputing conference | 2013

Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models

Jithin Jose; Sreeram Potluri; Karen Tomko; Dhabaleswar K. Panda

MPI has been the de-facto programming model for scientific parallel applications. However, it is hard to extract the maximum performance for irregular data-driven applications using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. The lower overhead in one-sided communication and the global view of data in PGAS models have the potential to increase the performance at scale. In this study, we take up ‘Concurrent Search’ kernel of Graph500 — a highly data driven irregular benchmark — and redesign it using both MPI and OpenSHMEM constructs. We also implement load balancing in Graph500. Our performance evaluations using MVAPICH2-X (Unified MPI+PGAS Communication Runtime over InfiniBand) indicate a 59% reduction in execution time for the hybrid design, compared to the best performing MPI based design at 8,192 cores.

ieee/acm international symposium cluster, cloud and grid computing | 2013

Efficient Intra-node Communication on Intel-MIC Clusters

Sreeram Potluri; Akshay Venkatesh; Devendar Bureddy; Krishna Chaitanya Kandalla; Dhabaleswar K. Panda

Accelerators and coprocessors have become a key component in modern supercomputing systems due to the superior performance per watt that they offer. Intels Xeon Phi coprocessor packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. This makes it an attractive choice for accelerating HPC applications. The Xeon Phi provides several channels for communication between MPI processes running on the coprocessor and the host. While supporting POSIX shared memory within the coprocessor, it exposes a low level API called the Symmetric Communication Interface (SCIF) that gives direct control of the DMA engine to the user. SCIF can also be used for communication between the coprocessor and the host. Xeon Phi also provides an implementation of the InfiniBand (IB) Verbs interface that enables a direct communication link with the InfiniBand adapter for communication between the coprocessor and the host. In this paper, we propose and evaluate design alternatives for efficient communication on a node with Xeon Phi coprocessor. We incorporate our designs in the popular MVAPICH2 MPI library. We use shared memory, IB Verbs and SCIF to design a hybrid solution that improves the MPI communication latency from Xeon Phi to the Host by 70%, for 4MByte messages, compared to an out-of-the-box version of MVAPICH2. Our solution delivers more than 6x improvement in peak uni-directional bandwidth from Xeon Phi to the Host and more than 3x improvement in bi-directional bandwidth. Through our designs, we are able to improve the performance of 16 process Gather, Alltoall and All gather collective operations by 70%, 85% and 80%, respectively, for 4MB messages. We further evaluate our designs using application benchmarks and show improvements of up to 18% with a 3D Stencil kernel and up to 11.5% with the P3DFFT library.

international conference on supercomputing | 2010

Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application

Sreeram Potluri; Ping Lai; Karen Tomko; Sayantan Sur; Yifeng Cui; Mahidhar Tatineni; Karl W. Schulz; William L. Barth; Amitava Majumdar; Dhabaleswar K. Panda

AWM-Olsen is a widely used ground motion simulation code based on a parallel finite difference solution of the 3-D velocity-stress wave equation. This application runs on tens of thousands of cores and consumes several million CPU hours on the TeraGrid Clusters every year. A significant portion of its run-time (37% in a 4,096 process run), is spent in MPI communication routines. Hence, it demands an optimized communication design coupled with a low-latency, high-bandwidth network and an efficient communication subsystem for good performance. In this paper, we analyze the performance bottlenecks of the application with regard to the time spent in MPI communication calls. We find that much of this time can be overlapped with computation using MPI non-blocking calls. We use both two-sided and MPI-2 one-sided communication semantics to re-design the communication in AWM-Olsen. We find that with our new design, using MPI-2 one-sided communication semantics, the entire application can be sped up by 12% at 4K processes and by 10% at 8K processes on a state-of-the-art InfiniBand cluster, Ranger at the Texas Advanced Computing Center (TACC).

international parallel and distributed processing symposium | 2013

Extending OpenSHMEM for GPU Computing

Sreeram Potluri; Devendar Bureddy; Hao Wang; Hari Subramoni; Dhabaleswar K. Panda

Graphics Processing Units (GPUs) are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. In order to maximize utilization, it is imperative that applications running on these clusters have low synchronization and communication overheads. Partitioned Global Address Space (PGAS) models provide an attractive approach for developing parallel scientific applications. Such models simplify programming through the abstraction of a shared memory address space while their one-sided communication primitives allow for efficient implementation of applications with minimum synchronization. OpenSHMEM is a library-based programming model that is gaining popularity. However, the current OpenSHMEM standard does not support direct communication from GPU device buffers. It requires data to be copied to the host memory before OpenSHMEM calls can be made. Similarly, data has to moved to the GPU explicitly by remote processes. This severely limits the programmability and performance of GPU applications. In this paper we provide extensions to the OpenSHMEM model which allow communication calls to be made directly on the GPU memory. The proposed extensions are interoperable with the two most popular GPU programming frameworks: CUDA and OpenCL. We present designs for an efficient OpenSHMEM runtime which transparently provide high-performance communication between GPUs in different inter-node and intra-node configurations. To the best of our knowledge this is the first work that enables GPU-GPU communication using the OpenSHMEM model for both CUDA and OpenCL computing frameworks. The proposed extensions to OpenSHMEM, coupled with the high-performance runtime, improve the latency of GPU-GPU shmem getmem operation by 90%, 40% and 17%, for intra-IOH (I/O Hub), inter-IOH and inter-node configurations. It improves the performance of OpenSHMEM atomics by up to 55% and 52%, for intra-IOH and inter-node GPU configurations respectively. The proposed enhancements improve the performance of Stencil2D kernel by 65% on a cluster of 192 GPUs and the performance of BFS kernel by 12% on a cluster of 96 GPUs.

Explore More