Ranjit Noronha
Ohio State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ranjit Noronha.
international conference on cluster computing | 2005
Shuang Liang; Ranjit Noronha; Dhabaleswar K. Panda
Traditionally, operations with memory on other nodes (remote memory) in cluster environments interconnected with technologies like Gigabit Ethernet have been expensive with latencies several magnitudes slower than local memory accesses. Modern RDMA capable networks such as InfiniBand and Quadrics provide low latency of a few microseconds and high bandwidth of up to 10 Gbps. This has significantly reduced the latency gap between access to local memory and remote memory in modern clusters. Remote idle memory can be exploited to reduce the memory pressure on individual nodes. This is akin to adding an additional level in the memory hierarchy between local memory and the disk, with potentially dramatic performance improvements especially for memory intensive applications. In this paper, we take on the challenge to design a remote paging system for remote memory utilization in InfiniBand clusters. We present the design and implementation of a high performance networking block device (HPBD) over InfiniBand fabric, which serves as a swap device of kernel virtual memory (VM) system for efficient page transfer to/from remote memory servers. Our experiments show that using HPBD, quick sort performs only 1.45 times slower than local memory system, and up to 21 times faster than local disk. And our design is completely transparent to user applications. To the best of our knowledge, it is the first work of a remote pager design using InfiniBand for remote memory utilization
international conference on cluster computing | 2005
Pavan Balaji; Wu-chun Feng; Qi Gao; Ranjit Noronha; Weikuan Yu; Dhabaleswar K. Panda
Despite the performance drawbacks of Ethernet, it still possesses a sizable footprint in cluster computing because of its low cost and backward compatibility to existing Ethernet infrastructure. In this paper, we demonstrate that these performance drawbacks can be reduced (and in some cases, arguably eliminated) by coupling TCP offload engines (TOEs) with 10-Gigabit Ethernet (10GigE). Although there exists significant research on individual network technologies such as 10GigE, InfiniBand (IBA), and Myrinet; to the best of our knowledge, there has been no work that compares the capabilities and limitations of these technologies with the recently introduced 10GigE TOEs in a homogeneous experimental testbed. Therefore, we present performance evaluations across 10GigE, IBA, and Myrinet (with identical cluster-compute nodes) in order to enable a coherent comparison with respect to the sockets interface. Specifically, we evaluate the network technologies at two levels: (i) a detailed micro-benchmark evaluation and (ii) an application-level evaluation with sample applications from different domains, including a bio-medical image visualization tool known as the Virtual Microscope, an iso-surface oil reservoir simulator, a cluster file-system known as the parallel virtual file-system (PVFS), and a popular cluster management tool known as Ganglia. In addition to 10GigEs advantage with respect to compatibility to wide-area network infrastructures, e.g., in support of grids, our results show that 10GigE also delivers performance that is comparable to traditional high-speed network technologies such as IBA and Myrinet in a system-area network environment to support clusters and that 10GigE is particularly well-suited for sockets-based applications
international parallel and distributed processing symposium | 2006
Weikuan Yu; Ranjit Noronha; Shuang Liang; Dhabaleswar K. Panda
Cluster file systems and storage area networks (SAN) make use of network IO to achieve higher IO bandwidth. Effective integration of networking mechanisms is important to their performance. In this paper, we perform an evaluation of a popular cluster file system, Lustre, over two of the leading high speed cluster interconnects: InfiniBand and Quadrics. Our evaluation is performed with both sequential IO and parallel IO benchmarks in order to explore the capacity of Lustre under different communication characteristics. Experimental results show that direct implementations of Lustre over both interconnects can improve its performance, compared to an IP emulation over InfiniBand (IPoIB). The performance of Lustre over Quadrics is comparable to that of Lustre over InfiniBand with the platforms we have. Latest InfiniBand products can embrace latest technologies, such as PCI-Express and DDR, and provide higher capacity. Our results show that over a Lustre file system with two object storage fervers (OSSs), InfiniBand with PCI-Express technology can improve Lustre write performance by 24%. Furthermore, our experimental results indicate that Lustre meta-data operations do not scale with an increasing number of OSSs, in spite of using high performance interconnects
international conference on parallel processing | 2007
Ranjit Noronha; Lei Chai; Thomas Talpey; Dhabaleswar K. Panda
NFS has traditionally used TCP or UDP as the underlying transport. However, the overhead of these stacks has limited both the performance and scalability of NFS. Recently, high-performance network such as InfiniBand have been deployed. These networks provide low latency of a few microseconds and high bandwidth for large messages up to 20 Gbps. Because of the unique characteristics of NFS protocols, previous designs of NFS with RDMA were unable to exploit the improved bandwidth of networks such as InfiniBand. Also, they leave the server open to attacks from malicious clients. In this paper, we discuss the design principles for implementing NFS/RDMA protocols. We propose, implement and evaluate an alternate design for NFS/RDMA on InfiniBand, which can significantly improve the security of the server, compared to the previous design. In addition, we evaluate the performance bottlenecks of using RDMA operations in NFS protocols and propose strategies and designs that tackle these overheads. With the best of these strategies and designs, we demonstrate throughput of 700 MB/s on the OpenSolaris NFS/RDMA design and 900 MB/s on the Linux design and an application level improvement in performance of up to 50%. We also evaluate the scalability of the RDMA transport in a multi-client setting, with a RAID array of disks. Our design has been integrated into the OpenSolaris kernel.
international parallel and distributed processing symposium | 2007
Ranjit Noronha; Dhabaleswar K. Panda
Modern multi-core architectures have become popular because of the limitations of deep pipelines and heating and power concerns. Some of these multi-core architectures such as the Intel Xeon have the ability to run several threads on a single core. The OpenMP standard for compiler directive based shared memory programming allows the developer an easy path to writing multi-threaded programs and is a natural fit for multi-core architectures. The OpenMP standard uses loop parallelism as a basis for work division among multiple threads. These loops usually use arrays in their computation with different data distributions and access patterns. The performance of accesses to these arrays may be impacted by the underlying page size depending on the frequency and strides of these accesses. In this paper, we discuss the issues and potential benefits from using large pages for OpenMP applications. We design an OpenMP implementation capable of using large pages and evaluate the impact of using large page support available in most modern processors on the performance and scalability of parallel OpenMP applications. Results show an improvement in performance of up to 25% for some applications. It also helps improve the scalability of these applications.
international conference on parallel processing | 2008
Ranjit Noronha; Dhabaleswar K. Panda
With the rapid advances in computing technology, there is an explosion in media that needs to collected, cataloged, stored and accessed. With the speed of disks not keeping pace with the improvements in processor and network speed, the ability of network file systems to provide data to demanding applications at an appropriate rate is diminishing. In this paper, we propose to enhance the performance of network file systems by providing an Intermediate bank of cache servers between the client and server called (IMCa). Whenever possible, file system operations from the client are serviced from the cache bank. We evaluate IMCa with a number of different benchmarks. The results of these experiments demonstrate that the intermediate cache architecture can reduce the latency of certain operations by upto 82% over the native implementation and upto 86% compared with the Lustre file system. In addition, we also see an improvement in the performance of data transfer operations in most cases and for most scenarios. Finally the caching hierarchy helps us to achieve better scalability of file system operations.
cluster computing and the grid | 2004
Ranjit Noronha; Dhabaleswar K. Panda
Software DSM systems do not perform well because of the combined effects of increase in communication, slow networks and the large overhead associated with processing the coherence protocol. Modern interconnects like Myrinet, Quadrics and InfiniBand offer reliable, low latency (around 5.0 /spl mu/s point-to-point), and high-bandwidth (10.0 Gbps in 4X InfiniBand). These networks also support efficient memory- based communication primitives like RDMA-Read and RDMA-Write, which allow remote reading and writing of data respectively without receiver intervention. These supports can be leveraged to reduce overhead in a software DSM system. In this paper, we propose a new scheme NGDSM with two components ARDMAR and DRAW for page fetching and diffing, respectively. These components employ RDMA and atomic operations to efficiently implement the coherency scheme. The scheme NGDSM evaluated on an 8-node InfiniBand cluster using SPLASH-2 and TreadMarks applications shows better parallel speedup and scalability.
international conference on parallel processing | 2008
Sundeep Narravula; Hari Subramoni; Ping Lai; Ranjit Noronha; Dhabaleswar K. Panda
High performance interconnects such as InfiniBand (IB)have enabled large scale deployments of High Performance Computing (HPC) systems. High performance communication and IO middleware such as MPI and NFS over RDMA have also been redesigned to leverage the performance of these modern interconnects. With the advent of long haul InfiniBand (IB WAN), IB applications now have inter-cluster reaches. While this technology is intended to enable high performance network connectivity across WAN links,it is important to study and characterize the actual performance that the existing IB middleware achieve in these emerging IB WAN scenarios. In this paper, we study and analyze the performance characteristics of the following three HPC middleware: (i)IPoIB (IP traffic over IB), (ii) MPI and (iii) NFS over RDMA. We utilize the Obsidian IB WAN routers for inter-cluster connectivity. Our results show that many of the applications absorb smaller network delays fairly well. However, most approaches get severely impacted in high delay scenarios. Further, communication protocols need to be optimized in higher delay scenarios to improve the performance. In this paper, we propose several such optimizations to improve communication performance. Our experimental results show that techniques such as WAN-aware protocols, transferring data using large messages (message coalescing) and using parallel data streams can improve the communication performance (up to 50%) in high delay scenarios. Overall, these results demonstrate that IB WAN technologies can enable cluster-of-clusters architecture as a feasible platform for HPC systems.
petascale data storage workshop | 2007
Lei Chai; Ranjit Noronha; Dhabaleswar K. Panda
The computing power of clusters has been rapidly growing up towards petascale capability, which requires petascale I/O systems to provide data in a sustained high-throughput manner. Network File System (NFS), a ubiquitous standard used in most existing clusters, has shown performance bottleneck associated with the single server model. pNFS, a parallel version of NFS, has been proposed in this context to eliminate the performance bottleneck while maintain the ease of management and interoperability features of NFS. With InfiniBand being one of the most popular high speed networks for clusters, whether pNFS can pick up the advantages of InfiniBand is an interesting and important question. It is also important to quantify and understand the potential benefits of using pNFS compared with the single server NFS, and the possible overhead associated with pNFS. However, since pNFS is relatively new, few such study has been carried out in an InfiniBand cluster environment. In this paper we have designed and carried out a set of experiments to study the performance and scalability of pNFS, using PVFS2 as the backend file system. The aim is to understand the characteristics of pNFS, and its feasibility as the parallel file system solution for clusters. From our experimental results we observer that pNFS can take advantages of high speed networks such as InfiniBand, and achieve up to 480% improvement in throughput compared with using GigE as the transport. pNFS can eliminate the single server bottleneck associated with NFS. pNFS/PVFS2 shows significantly higher throughput and better scalability compared with NFS/PVFS2. pNFS/PVFS2 achieves peak write throughput about 490MB/s, and read throughput about 2250MB/s, with 4 I/O servers. With 8 I/O servers, the numbers are 754MB/s and 3100MB/s. Further, we find that pNFS adds little overhead and achieves almost the same throughput as the backend file system PVFS2. Our results indicate that pNFS is promising as the parallel file system solution for clusters.
ieee international conference on high performance computing, data, and analytics | 2008
Ranjit Noronha; Dhabaleswar K. Panda
Large scale scientific and commercial applications consume and producepetabytes of data. This data needs to be safely stored, cataloged and reproducedwith high-performance. The current generation of single headed NAS(Network Attached Storage) based systems such as NFS is not able to provide anacceptable level of performance to these types of demanding applications. ClusteredNAS have evolved to meet the storage demands of these demanding applications.However, the performance of these Clustered NAS solutions is limited bythe communication protocol being used, usually TCP/IP. In this paper, we propose,design and evaluate a clustered NAS; pNFS over RDMA on InfiniBand. Our resultsshow that for a sequential workload on 8 data servers, the pNFS over RDMAdesign can achieve a peak aggregate Read throughput of up to 5,029 MB/s, a maximumimprovement of 188% over the TCP/IP transport and a Write throughput of1,872 MB/s; a maximum improvement of 150% over the corresponding TCP/IPtransport throughput. Evaluations with other type of workloads and traces showan improvement in performance of up to 27%. Finally, our design of pNFS overRDMA improves the performance of BTIO relative to the Lustre file system.