Sayantan Sur | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sayantan Sur is active.

Explore More

Publication

Featured researches published by Sayantan Sur.

international conference on parallel processing | 2011

Memcached Design on High Performance RDMA Capable Interconnects

Jithin Jose; Hari Subramoni; Miao Luo; Minjia Zhang; Jian Huang; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Hao Wang; Sayantan Sur; Dhabaleswar K. Panda

Memcached is a key-value distributed memory object caching system. It is used widely in the data-center environment for caching results of database calls, API calls or any other data. Using Memcached, spare memory in data-center servers can be aggregated to speed up lookups of frequently accessed information. The performance of Memcached is directly related to the underlying networking technology, as workloads are often latency sensitive. The existing Memcached implementation is built upon BSD Sockets interface. Sockets offers byte-stream oriented semantics. Therefore, using Sockets, there is a conversion between Memcacheds memory-object semantics and Sockets byte-stream semantics, imposing an overhead. This is in addition to any extra memory copies in the Sockets implementation within the OS. Over the past decade, high performance interconnects have employed Remote Direct Memory Access (RDMA) technology to provide excellent performance for the scientific computation domain. In addition to its high raw performance, the memory-based semantics of RDMA fits very well with Memcacheds memory-object model. While the Sockets interface can be ported to use RDMA, it is not very efficient when compared with low-level RDMA APIs. In this paper, we describe a novel design of Memcached for RDMA capable networks. Our design extends the existing open-source Memcached software and makes it RDMA capable. We provide a detailed performance comparison of our Memcached design compared to unmodified Memcached using Sockets over RDMA and 10Gigabit Ethernet network with hardware-accelerated TCP/IP. Our performance evaluation reveals that latency of Memcached Get of 4KB size can be brought down to 12 µs using ConnectX InfiniBand QDR adapters. Latency of the same operation using older generation DDR adapters is about 20µs. These numbers are about a factor of four better than the performance obtained by using 10GigE with TCP Offload. In addition, these latencies of Get requests over a range of message sizes are better by a factor of five to ten compared to IP over InfiniBand and Sockets Direct Protocol over InfiniBand. Further, throughput of small Get operations can be improved by a factor of six when compared to Sockets over 10 Gigabit Ethernet network. Similar factor of six improvement in throughput is observed over Sockets Direct Protocol using ConnectX QDR adapters. To the best of our knowledge, this is the first such memcached design on high performance RDMA capable interconnects.

Computer Science - Research and Development | 2011

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Hao Wang; Sreeram Potluri; Miao Luo; Ashish Kumar Singh; Sayantan Sur; Dhabaleswar K. Panda

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and programmer productivity. Applications executing on a cluster with GPUs have to manage data movement using CUDA in addition to MPI, the de-facto parallel programming standard. Currently, data movement with CUDA and MPI libraries is not integrated and it is not as efficient as possible. In addition, MPI-2 one sided communication does not work for windows in GPU memory, as there is no way to remotely get or put data from GPU memory in a one-sided manner.In this paper, we propose a novel MPI design that integrates CUDA data movement transparently with MPI. The programmer is presented with one MPI interface that can communicate to and from GPUs. Data movement from GPU and network can now be overlapped. The proposed design is incorporated into the MVAPICH2 library. To the best of our knowledge, this is the first work of its kind to enable advanced MPI features and optimized pipelining in a widely used MPI library. We observe up to 45% improvement in one-way latency. In addition, we show that collective communication performance can be improved significantly: 32%, 37% and 30% improvement for Scatter, Gather and Allotall collective operations, respectively. Further, we enable MPI-2 one sided communication with GPUs. We observe up to 45% improvement for Put and Get operations.

acm sigplan symposium on principles and practice of parallel programming | 2006

RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits

Sayantan Sur; Hyun-Wook Jin; Lei Chai; Dhabaleswar K. Panda

Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either RDMA Write or RDMA Read. Usually, this protocol is implemented using RDMA Write. The RDMA Write based protocol requires a two-way handshake between the sending and receiving processes. On the other hand, to achieve low latency, MPI implementations often provide a polling based progress engine. The two-way handshake requires the polling progress engine to discover multiple control messages. This in turn places a restriction on MPI applications that they should call into the MPI library to make progress. For compute or I/O intensive applications, it is not possible to do so. Thus, most communication progress is made only after the computation or I/O is over. This hampers the computation to communication overlap severely, which can have a detrimental impact on the overall application performance. In this paper, we propose several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters. Our evaluations reveal that it is possible to achieve nearly complete computation/communication overlap using our RDMA Read with Interrupt based Protocol. Additionally, our schemes yield around 50% better communication progress rate when computation is overlapped with communication. Further, our application evaluation with Linpack (HPL) and NAS-SP (Class C) reveals that MPI_Wait time is reduced by around 30% and 28%, respectively, for a 32 node InfiniBand cluster. We observe that the gains obtained in the MPI_Wait time increase as the system size increases. This indicates that our designs have a strong positive impact on scalability of parallel applications.

international conference on parallel processing | 2005

LiMIC: support for high-performance MPI intra-node communication on Linux cluster

Hyun-Wook Jin; Sayantan Sur; Lei Chai; Dhabaleswar K. Panda

High performance intra-node communication support for MPI applications is critical for achieving best performance from clusters of SMP workstations. Present day MPI stacks cannot make use of operating system kernel support for intra-node communication. This is primarily due to the lack of an efficient, portable, stable and MPI friendly interface to access the kernel functions. In this paper we attempt to address design challenges for implementing such a high performance and portable kernel module interface. We implement a kernel module interface called LiMIC and integrate it with MVAPICH, an open source MPI over InfiniBand. Our performance evaluation reveals that the point-to-point latency can be reduced by 71% and the bandwidth improved by 405% for 64 KB message size. In addition, LiMIC can improve HPCC effective bandwidth and NAS IS class B benchmarks by 12% and 8%, respectively, on an 8-node dual SMP InfiniBand cluster.

conference on high performance computing (supercomputing) | 2006

High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis

Sayantan Sur; Matthew J. Koop; Dhabaleswar K. Panda

InfiniBand is an emerging HPC interconnect being deployed in very large scale clusters, with even larger InfiniBand-based clusters expected to be deployed in the near future. The message passing interface (MPI) is the programming model of choice for scientific applications running on these large scale clusters. Thus, it is very critical for the MPI implementation used to be based on a scalable and high-performance design. We analyze the performance and scalability aspects of MVAPICH, a popular open-source MPI implementation on InfiniBand, from an application standpoint. We analyze the performance and memory requirements of the MPI library while executing several well-known applications and benchmarks, such as NAS, SuperLU, NAMD, and HPL on a 64-node InfiniBand cluster. Our analysis reveals that latest design of MVAPICH requires an order of magnitude less internal MPI memory (average per process) and yet delivers the best possible performance. Further, we observe that for these benchmarks and applications evaluated, the internal memory requirement of MVAPICH remains nearly constant at around 5-10 MB as the number of processes increase, indicating that the MVAPICH design is highly scalable

international conference on supercomputing | 2007

High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

Matthew J. Koop; Sayantan Sur; Qi Gao; Dhabaleswar K. Panda

High-performance clusters have been growing rapidly in scale. Most of these clusters deploy a high-speed interconnect, such as Infini-Band, to achieve higher performance. Most scientific applications executing on these clusters use the Message Passing Interface (MPI) as the parallel programming model. Thus, the MPI library has a key role in achieving application performance by consuming as few resources as possible and enabling scalable performance. State-of-the-art MPI implementations over InfiniBand primarily use the Reliable Connection (RC) transport due to its good performance and attractive features. However, the RC transport requires a connection between every pair of communicating processes, with each requiring several KB of memory. As clusters continue to scale, memory requirements in RC-based implementations increase. The connection-less Unreliable Datagram (UD) transport is an attractive alternative, which eliminates the need to dedicate memory for each pair of processes. In this paper we present a high-performance UD-based MPI design. We implement our design and compare the performance and resource usage with the RC-based MVAPICH. We evaluate NPB, SMG2000, Sweep3D, and sPPM up to 4K processes on an 9216-core InfiniBand cluster. For SMG2000, our prototype shows a 60% speedup and seven-fold reduction in memory for 4K processes. Additionally, based on our model, our design has an estimated 30 times reduction in memory over MVAPICH at 16K processes when all connections are created. To the best of our knowledge, this is the first research work that presents a high-performance MPI design over InfiniBand that is completely based on UD and can achieve near identical or better application performance than RC.

international parallel and distributed processing symposium | 2006

Shared receive queue based scalable MPI design for InfiniBand clusters

Sayantan Sur; Lei Chai; Hyun-Wook Jin; Dhabaleswar K. Panda

Clusters of several thousand nodes interconnected with InfiniBand, an emerging high-performance interconnect, have already appeared in the Top 500 list. The next-generation InfiniBand clusters are expected to be even larger with tens-of-thousands of nodes. A high-performance scalable MPI design is crucial for MPI applications in order to exploit the massive potential for parallelism in these very large clusters. MVAPICH is a popular implementation of MPI over InfiniBand based on its reliable connection oriented model. The requirement of this model to make communication buffers available for each connection imposes a memory scalability problem. In order to mitigate this issue, the latest InfiniBand standard includes a new feature called shared receive queue (SRQ) which allows sharing of communication buffers across multiple connections. In this paper, we propose a novel MPI design which efficiently utilizes SRQs and provides very good performance. Our analytical model reveals that our proposed designs take only 1/10th the memory requirement as compared to the original design on a cluster sized at 16,000 nodes. Performance evaluation of our design on our 8-node cluster shows that our new design was able to provide the same performance as the existing design while requiring much lesser memory. In comparison to tuned existing designs our design showed a 20% and 5% improvement in execution time of NAS Benchmarks (Class A) LU and SP, respectively. The high performance Linpack was able to execute a much larger problem size using our new design, whereas the existing design ran out of memory

Computer Science - Research and Development | 2011

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Krishna Chaitanya Kandalla; Hari Subramoni; Karen Tomko; Dmitry Pekurovsky; Sayantan Sur; Dhabaleswar K. Panda

Three-dimensional FFT is an important component of many scientific computing applications ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely used three-dimensional FFT package. It uses the Message Passing Interface (MPI) programming model. The performance and scalability of parallel 3D FFT is limited by the time spent in the Alltoall Personalized exchange (MPI_Alltoall) operations. Hiding the latency of the MPI_Alltoall operation is critical towards scaling P3DFFT. The newest revision of MPI, MPI-3, is widely expected to provide support for non-blocking collective communication to enable latency-hiding. The latest InfiniBand adapter from Mellanox, ConnectX-2, enables offloading of generalized lists of communication operations to the network interface. Such an interface can be leveraged to design non-blocking collective operations. In this paper, we design a scalable, non-blocking Alltoall Personalized Exchange algorithm based on the network offload technology. To the best of our knowledge, this is the first paper to propose high performance non-blocking algorithms for dense collective operations, by leveraging InfiniBand’s network offload features. We also re-design the P3DFFT library and a sample application kernel to overlap the Alltoall operations with application-level computation. We are able to scale our implementation of the non-blocking Alltoall operation to more than 512 processes and we achieve near perfect computation/communication overlap (99%). We also see an improvement of about 23% in the overall run-time of our modified P3DFFT when compared to the default-blocking version and an improvement of about 17% when compared to the host-based non-blocking Alltoall schemes.

high performance interconnects | 2007

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Sayantan Sur; Matthew J. Koop; Lei Chai; Dhabaleswar K. Panda

InfiniBand is an emerging networking technology that is gaining rapid acceptance in the HPC domain. Currently, several systems in the Top500 list use InfiniBand as their primary interconnect, with more being planned for near future. The fundamental architecture of the systems are undergoing a sea-change due to the advent of commodity multi-core computing. Due to the increase in the number of processes in each compute node, the network interface is expected to handle more communication traffic as compared to older dual or quad SMP systems. Thus, the network architecture should provide scalable performance as the number of processing cores increase. ConnectX is the fourth generation InfiniBand adapter from Mellanox Technologies. Its novel architecture enhances the scalability and performance of InfiniBand on multi-core clusters. In this paper, we carry out an in-depth performance analysis of ConnectX architecture comparing it with the third generation InfiniHost III architecture on the Intel Bensley platform with Dual Clovertown processors. Our analysis reveals that the aggregate bandwidth for small and medium sized messages can be increased by a factor of 10 as compared to the third generation InfiniHost III adapters. Similarly, RDMA-Write and RDMA-Read latencies for 1 -byte messages can be reduced by a factor of 6 and 3, respectively, even when all cores are communicating simultaneously. Evaluation with communication kernel Halo reveals a performance benefit of a factor of 2 to 5. Finally, the performance of LAMMPS, a molecular dynamics simulator, is improved by 10% for the in.rhodo benchmark.

ieee international conference on high performance computing data and analytics | 2005

High performance RDMA based all-to-all broadcast for infiniband clusters

Sayantan Sur; Uday Bondhugula; Amith R. Mamidala; Hyun-Wook Jin; Dhabaleswar K. Panda

The All-to-all broadcast collective operation is essential for many parallel scientific applications. This collective operation is called MPI_Allgather in the context of MPI. Contemporary MPI software stacks implement this collective on top of MPI point-to-point calls leading to several performance overheads. In this paper, we propose a design of All-to-All broadcast using the Remote Direct Memory Access (RDMA) feature offered by InfiniBand, an emerging high performance interconnect. Our RDMA based design eliminates the overheads associated with existing designs. Our results indicate that latency of the All-to-all Broadcast operation can be reduced by 30% for 32 processes and a message size of 32 KB. In addition, our design can improve the latency by a factor of 4.75 under no buffer reuse conditions for the same process count and message size. Further, our design can improve performance of a parallel matrix multiplication algorithm by 37% on eight processes, while multiplying a 256x256 matrix.

Explore More