Manjunath Gorentla Venkata

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manjunath Gorentla Venkata is active.

Explore More

Publication

Featured researches published by Manjunath Gorentla Venkata.

ieee/acm international symposium cluster, cloud and grid computing | 2011

Cheetah: A Framework for Scalable Hierarchical Collective Operations

Richard L. Graham; Manjunath Gorentla Venkata; Joshua S. Ladd; Pavel Shamis; Ishai Rabinovitz; Vasily Filipov; Gilad Shainer

Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passing Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications

Manjunath Gorentla Venkata; Richard L. Graham; Joshua S. Ladd; Pavel Shamis; Ishai Rabinovitz; Vasily Filipov; Gilad Shainer

This paper describes the design and implementation of InfiniBand (IB) {CORE-textit{Direct}} based blocking and nonblocking broadcast operations within the Cheetah collective operation framework. It describes a novel approach that fully offloads collective operations and employs only user-supplied buffers. For a 64 rank communicator, the latency of {CORE-textit{Direct}} based hierarchical algorithm is better than production-grade Message Passing Interface (MPI) implementations, 150% better than the default Open MPI algorithm and 115% better than the shared memory optimized MVAPICH implementation for a one kilo-byte (KB) message, and for eight mega-bytes (MB) it is 48% and 64% better, respectively. Flat-topology broadcast achieves 99.9% overlap in a polling based communication-computation test, and 95.1% overlap for a wait based test, compared with 92.4% and 17.0%, respectively, for a similar Central Processing Unit (CPU) based implementation.

high performance interconnects | 2015

UCX: An Open Source Framework for HPC Network APIs and Beyond

Pavel Shamis; Manjunath Gorentla Venkata; M. Graham Lopez; Matthew B. Baker; Oscar R. Hernandez; Yossi Itigin; Mike Dubman; Gilad Shainer; Richard L. Graham; Liran Liss; Yiftah Shahar; Sreeram Potluri; Davide Rossetti; Donald Becker; Duncan Poole; Christopher Lamb; Sameer Kumar; Craig B. Stunkel; George Bosilca; Aurelien Bouteiller

This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.

OpenSHMEM 2014 Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356 | 2014

Designing a High Performance OpenSHMEM Implementation Using Universal Common Communication Substrate as a Communication Middleware

Pavel Shamis; Manjunath Gorentla Venkata; Stephen W. Poole; Aaron Welch; Tony Curtis

OpenSHMEM is an effort to standardize the well-known SHMEM parallel programming library. The project aims to produce an open-source and portable SHMEM API and is led by ORNL and UH. In this paper, we optimize the current OpenSHMEM reference implementation, based on GASNet, to achieve higher performance characteristics. To achieve these desired performance characteristics, we have redesigned an important component of the OpenSHMEM implementation, the network layer, to leverage a low-level communication library designed for implementing parallel programming models called UCCS. In particular, UCCS provides an interface and semantics such as native atomic operations and remote memory operations to better support PGAS programming models, including OpenSHMEM. Through the use of microbenchmarks, we evaluate this new OpenSHMEM implementation on various network metrics, including the latency of point-to-point and collective operations. Furthermore, we compare the performance of our OpenSHMEM implementation with the state-of-the-art SGI SHMEM. Our results show that the atomic operations of our OpenSHMEM implementation outperform SGIs SHMEM implementation by 3%. Its RMA operations outperform both SGIs SHMEM and the original OpenSHMEM reference implementation by as much as 18% and 12% for gets, and as much as 83% and 53% for puts.

high performance interconnects | 2012

Performance Evaluation of Open MPI on Cray XE/XK Systems

Samuel K. Gutierrez; Nathan Hjelm; Manjunath Gorentla Venkata; Richard L. Graham

Open MPI is a widely used open-source implementation of the MPI-2 standard that supports a variety of platforms and interconnects. Current versions of Open MPI, however, lack support for the Cray XE6 and XK6 architectures -- both of which use the Gemini System Interconnect. In this paper, we present extensions to natively support these architectures within Open MPI, describe and propose solutions for performance and scalability bottlenecks, and provide an extensive evaluation of our implementation, which is the first completely open-source MPI implementation for the Cray XE/XK system families used at 49,152 processes. Application and micro-benchmark results show that the performance and scaling characteristics of our implementation are similar to the vendor-supplied MPIs. Micro-benchmark results show short-data 1-byte and 1,024-byte message latencies of 1.20 μs and 4.13 μs, which are 10.00% and 39.71% better than the vendor-supplied MPIs, respectively. Our implementation achieves a bandwidth of 5.32 GB/s at 8 MB, which is similar to the vendor-supplied MPIs bandwidth at the same message size. Two Sequoia benchmark applications, LAMMPS and AMG2006, were also chosen to evaluate our implementation at scales up to 49,152 cores -- where we exhibited similar performance and scaling characteristics when compared to the vendor-supplied MPI implementation. LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI, which is on par with the vendor-supplied MPIs achieved parallel efficiency.

ieee/acm international symposium cluster, cloud and grid computing | 2013

SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems

Cong Xu; Manjunath Gorentla Venkata; Richard L. Graham; Yandong Wang; Zhuo Liu; Weikuan Yu

Scientific applications use collective communication operations in Message Passing Interface (MPI) for global synchronization and data exchanges. Alltoall and AlltoallV are two important collective operations. They are used by MPI jobs to exchange messages among all of MPI processes. AlltoallV is a generalization of Alltoall, supporting messages of varying sizes. However, the existing MPI AlltoallV implementation has linear complexity, i.e., each process has to send messages to all other processes in the job. Such linear complexity can result in sub optimal scalability of MPI applications when they are deployed on millions of cores. To address above challenge, in this paper, we introduce a new Scalable LOgarithmic AlltoallV algorithm, named SLOAV, for MPI AlltoallV collective operation. SLOAV aims to achieve global exchange of small messages of different sizes in a logarithmic number of rounds. Furthermore, given the prevalence of multicore systems with shared memory, we design a hierarchical AlltoallV algorithm based on SLOAV by leveraging the advantages of shared memory, which is referred to as SLOAVx. Compared to SLOAV, SLOAVx significantly reduces the inter-node communication, thus improving the entire system performance and mitigating the impact of message latency. We have implemented and embedded both algorithms in Open MPI. Our evaluation on large-scale computer systems shows that for the 8-byte and 1024-process MPI Alltoallv operation, the SLOAV can reduce the latency by as much as 86.4%, when compared to the state-of-the-art, and SLOAVx can further optimize the SLOAV by up to 83.1% in terms of message latency on multicore systems. In addition, experiments with NAS Parallel Benchmark (NPB) demonstrate that our algorithms are very effective for real-world applications.

international conference on parallel processing | 2012

Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

Manjunath Gorentla Venkata; Richard L. Graham; Joshua S. Ladd; Pavel Shamis

The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Brucks algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Brucks algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.

Workshop on OpenSHMEM and Related Technologies | 2016

OpenSHMEM-UCX : Evaluation of UCX for implementing OpenSHMEM Programming Model

Matthew B. Baker; Ferrol Aderholdt; Manjunath Gorentla Venkata; Pavel Shamis

The OpenSHMEM reference implementation was developed towards the goal of developing an open source and high-performing OpenSHMEM implementation. To achieve portability and performance across various networks, the OpenSHMEM reference implementation uses GASNet and UCCS for network operations. Recently, new network layers have emerged with the promise of providing high-performance, scalability, and portability for HPC applications. In this paper, we implement the OpenSHMEM reference implementation to use the UCX framework for network operations. Then, we evaluate its performance and scalability on Cray XK systems to understand UCX’s suitability for developing the OpenSHMEM programming model. Further, we develop a benchmark called SHOMS for evaluating the OpenSHMEM implementation. Our experimental results show that OpenSHMEM-UCX outperforms the vendor supplied OpenSHMEM implementation in most cases on the Cray XK system by up to 40% with respect to message rate and up to 70% for the execution of application kernels.

OpenSHMEM 2015 Revised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 9397 | 2015

Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems

Sreeram Potluri; Davide Rossetti; Donald Becker; Duncan Poole; Manjunath Gorentla Venkata; Oscar R. Hernandez; Pavel Shamis; M. Graham Lopez; Matthew B. Baker; Wendy Poole

Extreme-scale systems with compute accelerators such as Graphical Processing Unit GPUs have become popular for executing scientific applications. These systems are typically programmed using MPI and CUDA for NVIDIA based GPUs. However, there are many drawbacks to the MPI+CUDA approach. The orchestration required between the compute and communication phases of the application execution, and the constraint that communication can only be initiated from serial portions on the Central Processing Unit CPU lead to scaling bottlenecks. To address these drawbacks, we explore the viability of using OpenSHMEMfor programming these systems. In this paper, first, we make a case for supporting GPU-initiated communication, and suitability of the OpenSHMEMprogramming model. Second, we present NVSHMEM, a prototype implementation of the proposed programming approach, port Stencil and Transpose benchmarks which are representative of many scientific applications from MPI+CUDA model to OpenSHMEM, and evaluate the design and implementation of NVSHMEM. Finally, we provide a discussion on the opportunities and challenges of OpenSHMEMto program these systems, and propose extensions to OpenSHMEMto achieve the full potential of this programming approach.

OpenSHMEM 2014 Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356 | 2014

OpenSHMEM Extensions and a Vision for Its Future Direction

Stephen W. Poole; Pavel Shamis; Aaron Welch; Swaroop Pophale; Manjunath Gorentla Venkata; Oscar R. Hernandez; Gregory A. Koenig; Tony Curtis; Chung-Hsing Hsu

The Extreme Scale Systems Center (ESSC) at Oak Ridge National Laboratory (ORNL), together with the University of Houston, led the effort to standardize the SHMEM API with input from the vendors and user community. In 2012, OpenSHMEM specification 1.0 was finalized and released to the OpenSHMEM community for comments. As we move to future HPC systems, there are several shortcomings in the current specification that we need to address to ensure scalability, higher degrees of concurrency, locality, thread safety, fault-tolerance, parallel I/O capabilities, etc. In this paper we discuss an immediate set of extensions that we propose to the current API and our vision for a future API, OpenSHMEM Next-Generation (NG), that targets future Exascale systems. We also explain our rational for the proposed extensions and highlight the lessons learned from other PGAS languages and communication libraries.

Explore More