Amith R. Mamidala | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amith R. Mamidala is active.

Explore More

Publication

Featured researches published by Amith R. Mamidala.

international parallel and distributed processing symposium | 2012

PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer

Sameer Kumar; Amith R. Mamidala; Daniel Faraj; Brian E. Smith; Michael Blocksome; Bob Cernohous; Douglas Miller; Jeffrey J. Parker; Joseph D. Ratterman; Philip Heidelberger; Dong Chen; Burkhard Steinmacher-Burrow

The Blue Gene/Q machine is the next generation in the line of IBM massively parallel supercomputers, designed to scale to 262144 nodes and sixteen million threads. With each BG/Q node having 68 hardware threads, hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, are ideal and will enable applications to achieve high throughput on BG/Q. With such unprecedented massive parallelism and scale, this paper is a groundbreaking effort to explore the design challenges for designing a communication library that can match and exploit such massive parallelism In particular, we present the Parallel Active Messaging Interface (PAMI) library as our BG/Q library solution to the many challenges that come with a machine at such scale. PAMI provides (1) novel techniques to partition the application communication overhead into many contexts that can be accelerated by communication threads, (2) client and context objects to support multiple and different programming paradigms, (3) lockless algorithms to speed up MPI message rate, and (4) novel techniques leveraging the new BG/Q architectural features such as the scalable atomic primitives implemented in the L2 cache, the highly parallel hardware messaging unit that supports both point-to-point and collective operations, and the collective hardware acceleration for operations such as broadcast, reduce, and all reduce. We experimented with PAMI on 2048 BG/Q nodes and the results show high messaging rates as well as low latencies and high throughputs for collective communication operations.

international conference on supercomputing | 2009

MPI collective communications on the blue gene/p supercomputer: algorithms and optimizations

Ahmad Faraj; Sameer Kumar; Brian E. Smith; Amith R. Mamidala; John A. Gunnels; Philip Heidelberger

The IBM Blue Gene/P (BG/P) system is a massively parallel supercomputer succeeding BG/L, and it comes with many machine design enhancements and new architectural features at the hardware and software levels. This paper presents techniques leveraging such features to deliver high performance MPI collective communication primitives. In particular, we exploit BG/P rich set of network hardware in exploring three classes of collective algorithms: global algorithms on global interrupt and collective networks for MPI COMM WORLD; rectangular algorithms for rectangular communicators on the torus network; and binomial algorithms for irregular communicators over the torus point-to-point network. We also utilize various forms of data movements including the direct memory access (DMA) engine, collective network, and shared memory, to implement synchronous and asynchronous algorithms of different objectives and performance characteristics. Our performance study on BG/P hardware with up to 16K nodes demonstrates the efficiency and scalability of the algorithms and optimizations.

high performance interconnects | 2009

MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations

Ahmad Faraj; Sameer Kumar; Brian E. Smith; Amith R. Mamidala; John A. Gunnels

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

Amith R. Mamidala; Daniel Faraj; Sameer Kumar; Douglas Miller; Michael Blocksome; Thomas Gooding; Philiph Heidelberger; Gabor Dozsa

The Blue Gene/P (BG/P) supercomputer consists of thousands of compute nodes interconnected by multiple networks. Out of these, a 3D torus equipped with direct memory access (DMA) engine is the primary network. BG/P also features a collective network which supports hardware accelerated collective operations such as broadcast and all reduce. One of the operating modes on BG/P is the virtual node mode where the four cores can be active MPI tasks, performing inter-node and intra-node communication. This paper proposes software techniques to enhance MPI Collective communication primitives, MPI Bcast and MPI Allreduce in virtual node mode by using cache coherent memory subsystem as the communication method within the node. The paper describes techniques leveraging atomic operations to design concurrent data structures such as broadcast-FIFOs to enable efficient collectives. Such mechanisms are important as we expect the core counts to rise in the future and having such data structures makes programming easier and efficient. We also demonstrate the utility of shared address space techniques for MPI collectives, wherein a process can access the peers memory by specialized system calls. Apart from cutting down the copy costs, such techniques allow for seamless integration of network protocols with intra-node communication methods. We propose intra-node extensions to multi-color network algorithms for collectives using light weight synchronizing structures and atomic operations. Further, we demonstrate that shared address techniques allow for good load balancing and are critical for efficiently using the hardware collective network on BG/P. When compared to current approaches on the 3D torus, our optimizations provide performance up to almost 3 folds for MPI Bcast and a 33% performance gain for MPI Allreduce(in virtual node mode). We also see improvements up to 44% for MPI Bcast using the collective tree network.

ieee international conference on high performance computing data and analytics | 2014

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Sameer Kumar; Amith R. Mamidala; Philip Heidelberger; Dong Chen; Daniel Faraj

The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q node has 68 hardware threads. Hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, enable applications to achieve high throughput on BG/Q. In this paper, we present scalable algorithms to optimize MPI collective operations by taking advantage of the various features of the BG/Q torus and collective networks. We achieve an 8 byte double-sum MPI_Allreduce latency of 10.25 ms on 1,572,864 MPI ranks. We accelerate summing of network packets with local buffers by the use of the Quad Processing SIMD unit in the BG/Q cores and executing the sums on multiple communication threads supported by the optimized communication libraries. The achieved net gain is a peak throughput of 6.3 GB/s for double-sum allreduce. We also achieve over 90% of network peak for MPI_Alltoall with 65,536 MPI ranks.

ieee international conference on high performance computing data and analytics | 2010

Architecture of the Component Collective Messaging Interface

Sameer Kumar; Ahmad Faraj; Amith R. Mamidala; Brian E. Smith; Gabor Dozsa; Bob Cernohous; John A. Gunnels; Douglas Miller; Joseph D. Ratterman; Philip Heidelberger

Different programming paradigms utilize a variety of collective communication operations, often with different semantics. We present the component collective messaging interface (CCMI) that can support asynchronous non-blocking collectives and is extensible to different programming paradigms and architectures. CCMI is designed with components written in the C++ programming language, allowing it to be reusable and extendible. Collective algorithms are embodied in topological schedules and executors that execute them. Portability across architectures is enabled by the multisend data movement component. CCMI includes a programming language adaptor used to implement different APIs with different semantics for different paradigms. We study the effectiveness of CCMI on 16K nodes of Blue Gene/P machine and evaluate its performance for the barrier, broadcast, and allreduce collective operations and several application benchmarks. We also present the performance of the barrier collective on the Abe Infiniband cluster.

Proceedings of the 23rd European MPI Users' Group Meeting on | 2016

Optimization of Message Passing Services on POWER8 InfiniBand Clusters

Sameer Kumar; Robert S. Blackmore; Sameh S. Sharkawi; K A Nysal Jan; Amith R. Mamidala; T. J. Chris Ward

We present scalability and performance enhancements to MPI libraries on POWER8 InfiniBand clusters. We explore optimizations in the Parallel Active Messaging Interface (PAMI) libraries. We bypass IB VERBS via low level inline calls resulting in low latencies and high message rates. MPI is enabled on POWER8 by extension of both MPICH and Open MPI to call PAMI libraries. The IBM POWER8 nodes have GPU accelerators to optimize floating throughput of the node. We explore optimized algorithms for GPU-to-GPU communication with minimal processor involvement. We achieve a peak MPI message rate of 186 million messages per second. We also present scalable performance in the QBOX and AMG applications.

ieee international conference on high performance computing data and analytics | 2012

Looking under the hood of the IBM blue gene/Q network

Dong Chen; Noel A. Eisley; Philip Heidelberger; Sameer Kumar; Amith R. Mamidala; Fabrizio Petrini; Robert M. Senger; Yutaka Sugawara; R. Walkup; Anamitra R. Choudhury; Yogish Sabharwal; Swati Singhal; Burkhard Steinmacher-Burow; Jeffrey J. Parker

Archive | 2010