Gopalakrishnan Santhanaraman

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gopalakrishnan Santhanaraman is active.

Explore More

Publication

Featured researches published by Gopalakrishnan Santhanaraman.

cluster computing and the grid | 2006

Design of High Performance MVAPICH2: MPI2 over InfiniBand

Wei Huang; Gopalakrishnan Santhanaraman; Hyun-Wook Jin; Qi Gao; Dhabaleswar K. Panda

MPICH2 provides a layered architecture for implementing MPI-2. In this paper, we provide a new design for implementing MPI-2 over InfiniBand by extending the MPICH2 ADI3 layer. Our new design aims to achieve high performance by providing a multi-communication method framework that can utilize appropriate communication channels/devices to attain optimal performance without compromising on scalability and portability. We also present the performance comparison of the new design with our previous design based on the MPICH2 RDMA channel. We show significant performance improvements in micro-benchmarks and NAS Parallel Benchmarks.

international parallel and distributed processing symposium | 2009

Designing multi-leader-based Allgather algorithms for multi-core clusters

Krishna Chaitanya Kandalla; Hari Subramoni; Gopalakrishnan Santhanaraman; Matthew J. Koop; Dhabaleswar K. Panda

The increasing demand for computational cycles is being met by the use of multi-core processors. Having large number of cores per node necessitates multi-core aware designs to extract the best performance. The Message Passing Interface (MPI) is the dominant parallel programming model on modern high performance computing clusters. The MPI collective operations take a significant portion of the communication time for an application. The existing optimizations for collectives exploit shared memory for intra-node communication to improve performance. However, it still would not scale well as the number of cores per node increase. In this work, we propose a novel and scalable multi-leader-based hierarchical Allgather design. This design allows better cache sharing for Non-Uniform Memory Access (NUMA) machines and makes better use of the network speed available with high performance interconnects such as InfiniBand. The new multi-leader-based scheme achieves a performance improvement of up to 58% for small messages and 70% for medium sized messages.

cluster computing and the grid | 2009

Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand

Gopalakrishnan Santhanaraman; Pavan Balaji; Karthik Gopalakrishnan; Rajeev Thakur; William Gropp; Dhabaleswar K. Panda

As high-end computing systems continue to grow in scale, the performance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel programming models (such as MPI, UPC, Global Arrays) have provided many programming constructs that enable implicit communication using one-sided communication operations. While MPI is the most widely used communication model for scientific computing, the usage of one-sided communication is restricted; this is mainly owing to the inefficiencies in current MPI implementations that internally rely on synchronization between processes even during one-sided communication, thus losing the potential of such constructs. In our previous work, we had utilized native one-sided communication primitives offered by high-speed networks such as InfiniBand (IB) to allow for true one-sided communication in MPI. In this paper, we extend this work to natively take advantage of one-sided atomic operations on cache-coherent multi-core/multi-processor architectures while still utilizing the benefits of networks such as IB. Specifically, we present a sophisticated hybrid design that uses locks that migrate between IB hardware atomics and multi-core CPU atomics to take advantage of both. We demonstrate the capability of our proposed design with a wide range of experiments illustrating its benefits in performance as well as its potential to avoid explicit synchronization.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2008

Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication

Rahul Kumar; Amith R. Mamidala; Matthew J. Koop; Gopalakrishnan Santhanaraman; Dhabaleswar K. Panda

Message Passing Interface (MPI) is the most commonly used method for programming distributed-memory systems. Most MPI implementations use a rendezvous protocol for transmitting large messages. One of the features desired in a MPI implementation is the ability to asynchronously progress the rendezvous protocol. This is important to provide potential for good computation and communication overlap to applications. There are several designs that have been proposed in previous work to provide asynchronous progress. These designs typically use progress helper threads with support from the network hardware to make progress on the communication. However, most of these designs use locking to protect the shared data structures in the critical communication path. Secondly, multiple interrupts may be necessary to make progress. Further, there is no mechanism to selectively ignore the events generated during communication. In this paper, we propose an enhanced asynchronous rendezvous protocol which overcomes these limitations. Specifically, our design does not require locks in the communication path. In our approach, the main application thread makes progress on the rendezvous transfer with the help of an additional thread. The communication between the two threads occurs via system signals. The new design can achieve near total overlap of communication with computation. Further, our design does not degrade the performance of non-overlapped communication. We have also experimented with different thread scheduling policies of Linux kernel and found out that the round robin policy provides the best performance. With the new design we have been able to achieve 20% reduction in time for a matrix multiplication kernel with MPI+OpenMP paradigm on 256 cores.

Lecture Notes in Computer Science | 2004

Zero-Copy MPI Derived Datatype Communication over InfiniBand

Gopalakrishnan Santhanaraman; Jiesheng Wu; Dhabaleswar K. Panda

This paper presents a new scheme, Send Gather Receive Scatter (SGRS), to perform zero-copy datatype communication over InfiniBand. This scheme leverages the gather/scatter feature provided by InfiniBand channel semantics. It takes advantage of the capability of processing non-contiguity on both send and receive sides in the Send Gather and Receive Scatter operations. In this paper, we describe the design, implementation and evaluation of this new scheme. Compared to the existing Multi-W zero-copy datatype scheme, the SGRS scheme can overcome the drawbacks of low network utilization and high startup costs. Our experimental results show significant improvement in both point-to-point and collective datatype communication. The latency of a vector datatype can be reduced by up to 62% and the bandwidth can be increased by up to 400%. The Alltoall collective benchmark shows a performance benefit of up to 23% reduction in latency.

international parallel and distributed processing symposium | 2008

Designing passive synchronization for MPI-2 one-sided communication to maximize overlap

Gopalakrishnan Santhanaraman; Sundeep Narravula; Dhabaleswar K. Panda

Scientific computing has seen an immense growth in recent years. MPI has become the de facto standard for parallel programming model for distributed memory systems. MPI-2 standard expanded MPI to include onesided communications. Computation and communication overlap is an important goal for one-sided applications. While the passive synchronization mechanism for MPI-2 one-sided communication allows for good overlap, the actual overlap achieved is often limited by the design of both the MPI library and the application. In this paper we aim to improve the performance of MPI-2 one-sided communication. In particular, we focus on the following important aspects: (i) designing one-sided passive synchronization (direct passive) support using InfiniBand atomic operations to handle both exclusive as well as shared locks (ii) enhancing one-sided communication progress to provide scope for better overlap that one-sided applications can leverage. (iii) study the overlap potential of passive synchronization and its impact on applications. We demonstrate the possible benefits of our approaches for the MPI-2 SPLASH LU application benchmark. Our results show an improvement of up to 87% for a 64 process run over the existing design.

ieee international conference on high performance computing data and analytics | 2005

Supporting MPI-2 one sided communication on multi-rail infiniband clusters: design challenges and performance benefits

Abhinav Vishnu; Gopalakrishnan Santhanaraman; Wei Huang; Hyun-Wook Jin; Dhabaleswar K. Panda

In cluster computing, InfiniBand has emerged as a popular high performance interconnect with MPI as the de facto programming model. However, even with InfiniBand, bandwidth can become a bottleneck for clusters executing communication intensive applications. Multi-rail cluster configurations with MPI-1 are being proposed to alleviate this problem. Recently, MPI-2 with support for one-sided communication is gaining significance. In this paper, we take the challenge of designing high performance MPI-2 one-sided communication on multi-rail InfiniBand clusters. We propose a unified MPI-2 design for different configurations of multi-rail networks (multiple ports, multiple HCAs and combinations). We present various issues associated with one-sided communication such as multiple synchronization messages, scheduling of RDMA (Read, Write) operations, ordering relaxation and discuss their implications on our design. Our performance results show that multi-rail networks can significantly improve MPI-2 one-sided communication performance. Using PCI-Express with two-ports, we can achieve a peak MPI_Put bidirectional bandwidth of 2620 Million Bytes/s, compared to 1910 MB/s for single-rail implementation. For PCI-X with two HCAs, we can almost double the throughput and reduce the latency to half for large messages.

international conference on parallel processing | 2007

High Performance MPI over iWARP: Early Experiences

Sundeep Narravula; Amith R. Mamidala; Abhinav Vishnu; Gopalakrishnan Santhanaraman; Dhabaleswar K. Panda

Modern interconnects and corresponding high performance MPIs have been feeding the surge in the popularity of compute clusters and computing applications. Recently with the introduction of the iWARP (Internet wide area RDMA protocol) standard, RDMA and zero-copy data transfer capabilities have been introduced and standardized for Ethernet networks. While traditional Ethernet networks had largely been limited to the traditional kernel based TCP/IP stacks and hence their limitations, iWARP capabilities of the newer GigE and 10 GigE adapters have broken this barrier and thereby exposing the available potential performance. In order to enable applications to harness the performance benefits of iWARP and to study the quantitative extent of such improvements, we present MPI- iWARP, a high performance MPI implementation over the open fabrics verbs. Our preliminary results with Chelsio T3B adapters show an improvement of up to 37% in bandwidth, 75% in latency and 80% in MPI all reduce as compared to MPICH2 over TCP/IP. To the best of our knowledge, this is the first design, implementation and evaluation of a high performance MPI over the iWARP standard.

acm sigplan symposium on principles and practice of parallel programming | 2007

On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact

Amith R. Mamidala; Sundeep Narravula; Abhinav Vishnu; Gopalakrishnan Santhanaraman; Dhabaleswar K. Panda

Communication subsystem plays a pivotal role in achieving scalable performance in clusters. The communication semantics employed are dictated by the programming model used by the application such as MPI, UPC, etc. Out of the gamut of communication primitives, collective and one-sided operations are especially significant and have to be designed harnessing the capabilities and features exposed by the underlying networks. In some cases, there is a direct match between the semantics of the operations and the underlying network primitives. InfiniBand provides two transport modes: (i)Connection-oriented Reliable connection (RC) supporting Memory and Channel semantics and (ii) Connection-less Unreliable Datagram (UD) supporting Channel semantics. Achieving good performance and scalability needs careful analysis and design of communication primitives based on these options. In this paper, we evaluate the scalability and performance trade-offs between RC and UD transport modes. We study the semantic advantages of mapping collective and one-sided operations on to memory and channel semantics of InfiniBand(IBA). We take AlltoAll as a case study to demonstrate the benefits of RDMA over Send/Recv and to show the performance/memory trade-offs over IB transports. Our experimental results show that UD-based AlltoAll performs 38% better than Brucks algorithm for short messages and up to two times better than the direct AlltoAll over RC. Since InfiniBand does not provide RDMA over UD in hardware, we emulate the same in our study. Our results show a performance dip of up to a factor of three for emulated RDMA Read latency as compared to RC, highlighting the need for hardware based RDMA operations over UD.

international parallel and distributed processing symposium | 2005

Scheduling of MPI-2 one sided operations over InfiniBand

Wei Huang; Gopalakrishnan Santhanaraman; Hyun-Wook Jin; Dhabaleswar K. Panda

MPI-2 provides interfaces for one sided communication, which is becoming increasingly important in scientific applications. MPI-2 semantics provide the flexibility to reorder the one sided operations within an access epoch. Based on this flexibility, in this paper we try to improve the performance of one sided communication by scheduling one sided operations. We have come up with several re-ordering and aggregating schemes to achieve better network utilization. We have evaluated these schemes on both PCI-X and PCI-Express platforms. With re-ordering scheme, we see an improvement in the throughput up to 76%, latency up to 40%. With aggregation scheme, we observe an improvement of 44% and 42% for MPI/spl I.bar/Put and MPI/spl I.bar/Get latency respectively on PCI-Express platform.

Explore More