Sundeep Narravula
Ohio State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sundeep Narravula.
international symposium on performance analysis of systems and software | 2004
Pavan Balaji; Sundeep Narravula; Karthikeyan Vaidyanathan; Savitha Krishnamoorthy; Jiesheng Wu; Dhabaleswar K. Panda
The Sockets Direct Protocol (SDP) had been proposed recently in order to enable sockets based applications to take advantage of the enhanced features provided by InfiniBand architecture. In this paper, we study the benefits and limitations of an implementation of SDP. We first analyze the performance of SDP based on a detailed suite of micro-benchmarks. Next, we evaluate it on two different real application domains: (1) A multitier data-center environment and (2) A Parallel Virtual File System (PVFS). Our micro-benchmark results show that SDP is able to provide up to 2.7 times better bandwidth as compared to the native sockets implementation over InfiniBand (IPoIB) and significantly better latency for large message sizes. Our experimental results also show that SDP is able to achieve a considerably higher performance (improvement of up to 2.4 times) as compared to IPoIB in the PVFS environment. In the data-center environment, SDP outperforms IPoIB for large file transfers inspite of currently being limited by a high connection setup time. However, this limitation is entirely implementation specific and as the InfiniBand software and hardware products are rapidly maturing, we expect this limitation to be overcome soon. Based on this, we have shown that the projected performance for SDP, without the connection setup time, can outperform IPoIB for small message transfers as well.
knowledge discovery and data mining | 2003
Matthew Eric Otey; Srinivasan Parthasarathy; Amol Ghoting; G. Li; Sundeep Narravula; Dhabaleswar K. Panda
We present and evaluate a NIC-based network intrusion detection system. Intrusion detection at the NIC makes the system potentially tamper-proof and is naturally extensible to work in a distributed setting. Simple anomaly detection and signature detection based models have been implemented on the NIC firmware, which has its own processor and memory. We empirically evaluate such systems from the perspective of quality and performance (bandwidth of acceptable messages) under varying conditions of host load. The preliminary results we obtain are very encouraging and lead us to believe that such NIC-based security schemes could very well be a crucial part of next generation network security systems.
cluster computing and the grid | 2012
Jithin Jose; Hari Subramoni; Krishna Chaitanya Kandalla; Md. Wasi-ur-Rahman; Hao Wang; Sundeep Narravula; Dhabaleswar K. Panda
Mem cached is a general-purpose key-value based distributed memory object caching system. It is widely used in data-center domain for caching results of database calls, API calls or page rendering. An efficient Mem cached design is critical to achieve high transaction throughput and scalability. Previous research in the field has shown that the use of high performance interconnects like InfiniBand can dramatically improve the performance of Mem cached. The Reliable Connection (RC) is the most commonly used transport model for InfiniBand implementations. However, it has been shown that RC transport imposes scalability issues due to high memory consumption per connection. Such a characteristic is not favorable for middle wares like Mem cached, where the server is required to serve thousands of clients. The Unreliable Datagram (UD) transport offers higher scalability, but has several other limitations, which need to be efficiently handled. In this context, we introduce a hybrid transport model which takes advantage of the best features of RC and UD to deliver scalability and performance higher than that of a single-transport. To the best of our knowledge, this is the first effort aimed at studying the impact of using a hybrid of multiple transport protocols on Mem cached performance. We present comprehensive performance analysis using micro benchmarks, application benchmarks and realistic industry workloads. Our performance evaluations reveal that our Hybrid transport delivers performance comparable to that of RC, while maintaining a steady memory footprint. Mem cached Get latency for 4byte data size, is 4.28μs and 4.86μs for RC and hybrid transports, respectively. This represents a factor of twelve improvement over the performance of SDP. In evaluations using Apache Olio benchmark with 1,024 clients, Mem cached execution time using RC, UD and hybrid transports are 1.61, 1.96 and 1.70 seconds, respectively. Further, our scalability analysis with 4,096 client connections reveal that our proposed hybrid transport achieves good memory scalability.
cluster computing and the grid | 2007
Abhinav Vishnu; Matthew J. Koop; Adam Moody; Amith R. Mamidala; Sundeep Narravula; Dhabaleswar K. Panda
Large scale InfiniBand clusters are becoming increasingly popular, as reflected by the TOP 500 supercomputer rankings. At the same time, fat tree has become a popular interconnection topology for these clusters, since it allows multiple paths to be available in between a pair of nodes. However, even with fat tree, hot-spots may occur in the network depending upon the route configuration between end nodes and communication pattern(s) in the application. To make matters worse, the deterministic routing nature of InfiniBand limits the application from effective use of multiple paths transparently and avoid the hot-spots in the network. Simulation based studies for switches and adapters to implement congestion control have been proposed in the literature. However, these studies have focussed on providing congestion control for the communication path, and not on utilizing multiple paths in the network for hot-spot avoidance. In this paper, we design an MPI functionality, which provides hot-spot avoidance for different communications, without a priori knowledge of the pattern. We leverage LMC (LID mask count) mechanism of InfiniBand to create multiple paths in the network and present the design issues (scheduling policies, selecting number of paths, scalability aspects) of our design. We implement our design and evaluate it with Pallas collective communication and MPI applications. On an InfiniBand cluster with 48 processes, MPI All-to-all personalized shows an improvement of 27%. Our evaluation with NAS parallel benchmarks on 64 processes shows significant improvement in execution time with this functionality.
high performance computational finance | 2008
Hari Subramoni; Gregory Marsh; Sundeep Narravula; Ping Lai; Dhabaleswar K. Panda
Message oriented middleware (MOM) is a key technology in financial market data delivery. In this context we study the advanced message queuing protocol (AMQP), an emerging open standard for MOM communication. We design a basic suite of benchmarks for AMQPpsilas Direct, Fanout, and Topic Exchange types. We then evaluate these benchmarks with Apache Qpid, an open source implementation of AMQP. In order to observe how AMQP performs in a real-life scenario, we also perform evaluations with a simulated stock exchange application. All our evaluations are performed over InfiniBand as well as 1 Gigabit Ethernet networks. Our results indicate that in order to achieve the high scalability requirements demanded by high performance computational finance applications, we need to use modern communication protocols, like RDMA, which place less processing load on the host. We also find that the centralized architecture of AMQP presents a considerable bottleneck as far as scalability is concerned.
cluster computing and the grid | 2006
Sundeep Narravula; Hyun-Wook Jin; Karthikeyan Vaidyanathan; Dhabaleswar K. Panda
Caching has been a very important technique in improving the performance and scalability of web-serving datacenters. The research community has proposed cooperation of caching servers to achieve higher performance benefits. These existing cooperative caching mechanisms often partially duplicate the cached data redundantly on multiple servers for higher performance (by optimizing the datafetch costs for multiple similar requests). With the advent of RDMA enabled interconnects these basic data-fetch cost estimates have changed significantly. Further, the effective utilization of the vast resources available across multiple tiers in today’s data-centers is of obvious interest. Hence, a systematic study of these various issues involved is of paramount importance. In this paper, we present several cooperative caching schemes that are designed to benefit in the light of the above mentioned trends. In particular, we design schemes that take advantage of the RDMA capabilities of networks and the multitude of resources available in modern multi-tier data-centers. Our designs are implemented on InfiniBand based clusters to work in conjunction with Apache based servers. Our experimental results show that our schemes achieve a throughput improvement of up to 35% as compared to the basic cooperative caching schemes and 180% better than the simple single node caching schemes. Our experimental results lead us to a new scheme which can deliver good performance in many Caching has been a very important technique in improving the performance and scalability of web-serving datacenters. The research community has proposed cooperation of caching servers to achieve higher performance benefits. These existing cooperative caching mechanisms often partially duplicate the cached data redundantly on multiple servers for higher performance (by optimizing the datafetch costs for multiple similar requests). With the advent of RDMA enabled interconnects these basic data-fetch cost estimates have changed significantly. Further, the effective utilization of the vast resources available across multiple tiers in today’s data-centers is of obvious interest. Hence, a systematic study of these various issues involved is of paramount importance. In this paper, we present several cooperative caching schemes that are designed to benefit in the light of the above mentioned trends. In particular, we design schemes that take advantage of the RDMA capabilities of networks and the multitude of resources available in modern multi-tier data-centers. Our designs are implemented on InfiniBand based clusters to work in conjunction with Apache based servers. Our experimental results show that our schemes achieve a throughput improvement of up to 35% as compared to the basic cooperative caching schemes and 180% better than the simple single node caching schemes. Our experimental results lead us to a new scheme which can deliver good performance in many scenarios.
cluster computing and the grid | 2007
Sundeep Narravula; Amith R. Mamidala; Abhinav Vishnu; Karthikeyan Vaidyanathan; Dhabaleswar K. Panda
There has been a massive increase in computing requirements for parallel applications. These parallel applications and supporting cluster services often need to share system-wide resources. The coordination of these applications is typically managed by a distributed lock manager. The performance of the lock manager is extremely critical for application performance. Researchers have shown that the use of two sided communication protocols, like TCP/IP (used by current generation lock managers), can have significant impact on the scalability of distributed lock managers. In addition, existing one sided communication based locking designs support locking in exclusive access mode only and can pose significant scalability limitations on applications that need both shared and exclusive access modes like cooperative/file-system caching. Hence the utility of these existing designs in high performance scenarios can be limited. In this paper, we present a novel protocol, for distributed locking services, utilizing the advanced network-level one-sided atomic operations provided by InfiniBand. Our approach augments existing approaches by eliminating the need for two sided communication protocols in the critical locking path. Further, we also demonstrate that our approach provides significantly higher performance in scenarios needing both shared and exclusive mode access to resources. Our experimental results show 39% improvement in basic locking latencies over traditional send/receive based implementations. Further, we also observe a significant (up to 317% for 16 nodes) improvement over existing RDMA based distributed queuing schemes for shared mode locking scenarios.
international conference on parallel processing | 2009
Ping Lai; Hari Subramoni; Sundeep Narravula; Amith R. Mamidala; Dhabaleswar K. Panda
The rapid growth of InfiniBand, 10 Gigabit Ethernet/iWARP and IB WAN extensions is increasingly gaining momentum for designing high end computing clusters and data-centers. For typical applications such as data staging, content replication and remote site backup, FTP has been the most popular method to transfer data within and across these clusters. Although the existing sockets based FTP approaches can be transparently used in these systems through the protocols like IPoIB or SDP, their performance and scalability are limited due to the additional interaction overhead and unoptimized protocol processing. This leads to a challenge how to design more efficient FTP mechanisms by leveraging the advanced features of modern interconnects. In this paper we design a new Advanced Data Transfer Service (ADTS) with the capabilities such as zero-copy data-transfer, memory registration cache, persistent data sessions and pipelined data transfer etc. to enable efficient zero-copy data transfers over IB and iWARP equipped LAN and WAN. We then utilize ADTS to design a high performance FTP library (FTP-ADTS). From our experimental results, we observe that our design outperforms existing sockets based approaches by more that 95 in transferring large volumes of data over LAN. It also provides significantly better performance at much lower (by up to a factor of 6) CPU utilization in various IB WAN scenarios. These results present the promising future for designing high performance communication protocols to power the efficiency and scalability of next-generation parallel and distributed environments.
international parallel and distributed processing symposium | 2008
Gopalakrishnan Santhanaraman; Sundeep Narravula; Dhabaleswar K. Panda
Scientific computing has seen an immense growth in recent years. MPI has become the de facto standard for parallel programming model for distributed memory systems. MPI-2 standard expanded MPI to include onesided communications. Computation and communication overlap is an important goal for one-sided applications. While the passive synchronization mechanism for MPI-2 one-sided communication allows for good overlap, the actual overlap achieved is often limited by the design of both the MPI library and the application. In this paper we aim to improve the performance of MPI-2 one-sided communication. In particular, we focus on the following important aspects: (i) designing one-sided passive synchronization (direct passive) support using InfiniBand atomic operations to handle both exclusive as well as shared locks (ii) enhancing one-sided communication progress to provide scope for better overlap that one-sided applications can leverage. (iii) study the overlap potential of passive synchronization and its impact on applications. We demonstrate the possible benefits of our approaches for the MPI-2 SPLASH LU application benchmark. Our results show an improvement of up to 87% for a 64 process run over the existing design.
international parallel and distributed processing symposium | 2007
Abhinav Vishnu; Amith R. Mamidala; Sundeep Narravula; Dhabaleswar K. Panda
High computational power of commodity PCs combined with the emergence of low latency and high bandwidth interconnects has escalated the trends of cluster computing. Clusters with InfiniBand are being deployed, as reflected in the TOP 500 Supercomputer rankings. However, increasing scale of these clusters has reduced the mean time between failures (MTBF) of components. Network component is one such component of clusters, where failure of network interface cards (NICs), cables and/or switches breaks existing path(s) of communication. InfiniBand provides a hardware mechanism, automatic path migration (APM), which allows user transparent detection and recovery from network fault(s), without application restart. In this paper, we design a set of modules; which work together for providing network fault tolerance for user level applications leveraging the APM feature. Our performance evaluation at the MPI layer shows that APM incurs negligible overhead in the absence of faults in the system. In the presence of network faults, APM incurs negligible overhead for reasonably long running applications.