Jithin Jose | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jithin Jose is active.

Explore More

Publication

Featured researches published by Jithin Jose.

ieee international conference on high performance computing data and analytics | 2012

High performance RDMA-based design of HDFS over InfiniBand

Nusrat Sharmin Islam; Md. Wasi-ur Rahman; Jithin Jose; Raghunath Rajachandrasekar; Hao Wang; Hari Subramoni; Chet Murthy; Dhabaleswar K. Panda

Hadoop Distributed File System (HDFS) acts as the primary storage of Hadoop and has been adopted by reputed organizations (Facebook, Yahoo! etc.) due to its portability and fault-tolerance. The existing implementation of HDFS uses Javasocket interface for communication which delivers suboptimal performance in terms of latency and throughput. For dataintensive applications, network performance becomes key component as the amount of data being stored and replicated to HDFS increases. In this paper, we present a novel design of HDFS using Remote Direct Memory Access (RDMA) over InfiniBand via JNI interfaces. Experimental results show that, for 5GB HDFS file writes, the new design reduces the communication time by 87% and 30% over 1Gigabit Ethernet (1GigE) and IP-over-InfiniBand (IPoIB), respectively, on QDR platform (32Gbps). For HBase, the Put operation performance is improved by 26% with our design. To the best of our knowledge, this is the first design of HDFS over InfiniBand networks.

international conference on parallel processing | 2011

Memcached Design on High Performance RDMA Capable Interconnects

Jithin Jose; Hari Subramoni; Miao Luo; Minjia Zhang; Jian Huang; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Hao Wang; Sayantan Sur; Dhabaleswar K. Panda

Memcached is a key-value distributed memory object caching system. It is used widely in the data-center environment for caching results of database calls, API calls or any other data. Using Memcached, spare memory in data-center servers can be aggregated to speed up lookups of frequently accessed information. The performance of Memcached is directly related to the underlying networking technology, as workloads are often latency sensitive. The existing Memcached implementation is built upon BSD Sockets interface. Sockets offers byte-stream oriented semantics. Therefore, using Sockets, there is a conversion between Memcacheds memory-object semantics and Sockets byte-stream semantics, imposing an overhead. This is in addition to any extra memory copies in the Sockets implementation within the OS. Over the past decade, high performance interconnects have employed Remote Direct Memory Access (RDMA) technology to provide excellent performance for the scientific computation domain. In addition to its high raw performance, the memory-based semantics of RDMA fits very well with Memcacheds memory-object model. While the Sockets interface can be ported to use RDMA, it is not very efficient when compared with low-level RDMA APIs. In this paper, we describe a novel design of Memcached for RDMA capable networks. Our design extends the existing open-source Memcached software and makes it RDMA capable. We provide a detailed performance comparison of our Memcached design compared to unmodified Memcached using Sockets over RDMA and 10Gigabit Ethernet network with hardware-accelerated TCP/IP. Our performance evaluation reveals that latency of Memcached Get of 4KB size can be brought down to 12 µs using ConnectX InfiniBand QDR adapters. Latency of the same operation using older generation DDR adapters is about 20µs. These numbers are about a factor of four better than the performance obtained by using 10GigE with TCP Offload. In addition, these latencies of Get requests over a range of message sizes are better by a factor of five to ten compared to IP over InfiniBand and Sockets Direct Protocol over InfiniBand. Further, throughput of small Get operations can be improved by a factor of six when compared to Sockets over 10 Gigabit Ethernet network. Similar factor of six improvement in throughput is observed over Sockets Direct Protocol using ConnectX QDR adapters. To the best of our knowledge, this is the first such memcached design on high performance RDMA capable interconnects.

international conference on parallel processing | 2013

High-Performance Design of Hadoop RPC with RDMA over InfiniBand

Xiaoyi Lu; Nusrat Sharmin Islam; Md. Wasi-ur-Rahman; Jithin Jose; Hari Subramoni; Hao Wang; Dhabaleswar K. Panda

Hadoop RPC is the basic communication mechanism in the Hadoop ecosystem. It is used with other Hadoop components like MapReduce, HDFS, and HBase in real world data-centers, e.g. Facebook and Yahoo!. However, the current Hadoop RPC design is built on Java sockets interface, which limits its potential performance. The High Performance Computing community has exploited high throughput and low latency networks such as InfiniBand for many years. In this paper, we first analyze the performance of current Hadoop RPC design by unearthing buffer management and communication bottlenecks, that are not apparent on the slower speed networks. Then we propose a novel design (RPCoIB) of Hadoop RPC with RDMA over InfiniBand networks. RPCoIB provides a JVM-bypassed buffer management scheme and utilizes message size locality to avoid multiple memory allocations and copies in data serialization and deserialization. Our performance evaluations reveal that the basic ping-pong latencies for varied data sizes are reduced by 42%-49% and 46%-50% compared with 10GigE and IPoIB QDR (32Gbps), respectively, while the RPCoIB design also improves the peak throughput by 82% and 64% compared with 10GigE and IPoIB. As compared to default Hadoop over IPoIB QDR, our RPCoIB design improves the performance of the Sort benchmark on 64 compute nodes by 15%, while it improves the performance of CloudBurst application by 10%. We also present thorough, integrated evaluations of our RPCoIB design with other research directions, which optimize HDFS and HBase using RDMA over InfiniBand. Compared with their best performance, we observe 10% improvement for HDFS-IB, and 24% improvement for HBase-IB. To the best of our knowledge, this is the first such design of the Hadoop RPC system over high performance networks such as InfiniBand.

international parallel and distributed processing symposium | 2012

High-Performance Design of HBase with RDMA over InfiniBand

Jian Huang; Jithin Jose; Md. Wasi-ur-Rahman; Hao Wang; Miao Luo; Hari Subramoni; Chet Murthy; Dhabaleswar K. Panda

HBase is an open source distributed Key/Value store based on the idea of Big Table. It is being used in many data-center Papplications (e.g. Face book, Twitter, etc.) because of its portability and massive scalability. For this kind of system, low latency and high throughput is expected when supporting services for large scale concurrent accesses. However, the existing HBase implementation is built upon Java Sockets Interface that provides sub-optimal performance due to the overhead to provide cross-platform portability. The byte-stream oriented Java sockets semantics confine the possibility to leverage new generations of network technologies. This makes it hard to provide high performance services for data-intensive applications. High Performance Computing (HPC) domain has exploited high performance and low latency networks such as Infini Band for many years. These interconnects provide advanced network features, such as Remote Direct Memory Access (RDMA), to achieve high throughput and low latency along with low CPU utilization. RDMA follows memory-block semantics, which can be adopted efficiently to satisfy the object transmission primitives used in HBase. In this paper, we present a novel design of HBase for RDMA capable networks via Java Native Interface (JNI). Our design extends the existing open-source HBase software and makes it RDMA capable. Our performance evaluation reveals that latency of HBase Get operations of 1KB message size can be reduced to 43.7μs with the new design on QDR platform (32 Gbps). This is about a factor of 3.5 improvement over 10 Gigabit Ethernet (10 GigE) network with TCP Offload. Throughput evaluations using four HBase region servers and 64 clients indicate that the new design boosts up throughput by 3 X times over 1 GigE and 10 GigE networks. To the best of our knowledge, this is first HBase design utilizing high performance RDMA capable interconnects.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand

Wasi-ur-Rahman; Nusrat Sharmin Islam; Xiaoyi Lu; Jithin Jose; Hari Subramoni; Hao Wang; Dhabaleswar K. Panda

MapReduce is a very popular programming model used to handle large datasets in enterprise data centers and clouds. Although various implementations of MapReduce exist, Hadoop MapReduce is the most widely used in large data centers like Facebook, Yahoo! and Amazon due to its portability and fault tolerance. Network performance plays a key role in determining the performance of data intensive applications using Hadoop MapReduce as data required by the map and reduce processes can be distributed across the cluster. In this context, data center designers have been looking at high performance interconnects such as InfiniBand to enhance the performance of their Hadoop MapReduce based applications. However, achieving better performance through usage of high performance interconnects like InfiniBand is a significant task. It requires a careful redesign of communication framework inside MapReduce. Several assumptions made for current socket based communication in the current framework do not hold true for high performance interconnects. In this paper, we propose the design of an RDMA-based Hadoop MapReduce over InfiniBand and several design elements: data shuffle over InfiniBand, in-memory merge mechanism for the Reducer, and pre-fetch data for the Mapper. We perform our experiments on native InfiniBand using Remote Direct Memory Access (RDMA) and compare our results with that of Hadoop-A [1] and default Hadoop over different interconnects and protocols. For all these experiments, we perform network level parameter tuning and use optimum values for each Hadoop design. Our performance results show that, for a 100GB TeraSort running on an eight node cluster, we achieve a performance improvement of 32% over IP-over InfiniBand (IPoIB) and 21% over Hadoop-A. With multiple disks per node, this benefit rises up to 39% over IPoIB and 31% over Hadoop-A.

ieee/acm international symposium cluster, cloud and grid computing | 2013

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Jithin Jose; Mingzhe Li; Xiaoyi Lu; Krishna Chaitanya Kandalla; Mark Daniel Arnold; Dhabaleswar K. Panda

High Performance Computing (HPC) systems are becoming increasingly complex and are also associated with very high operational costs. The cloud computing paradigm, coupled with modern Virtual Machine (VM) technology offers attractive techniques to easily manage large scale systems, while significantly bringing down the cost of computation, memory and storage. However, running HPC applications on cloud systems still remains a major challenge. One of the biggest hurdles in realizing this objective is the performance offered by virtualized computing environments, more specifically, virtualized I/O devices. Since HPC applications and communication middlewares rely heavily on advanced features offered by modern high performance interconnects such as InfiniBand, the performance of virtualized InfiniBand interfaces is crucial. Emerging hardware-based solutions, such as the Single Root I/O Virtualization (SR-IOV), offer an attractive alternative when compared to existing software-based solutions. The benefits of SR-IOV have been widely studied for GigE and 10GigE networks. However, with InfiniBand networks being increasingly adopted in the cloud computing domain, it is critical to fully understand the performance benefits of SR-IOV in InfiniBand network, especially for exploring the performance characteristics and trade-offs of HPC communication middlewares (such as Message Passing Interface (MPI), Partitioned Global Address Space (PGAS)) and applications. To the best of our knowledge, this is the first paper that offers an in-depth analysis on SR-IOV with InfiniBand. Our experimental evaluations show that for the performance of MPI and PGAS point-to-point communication benchmarks over SR-IOV with InfiniBand is comparable to that of the native InfiniBand hardware, for most message lengths. However, we observe that the performance of MPI collective operations over SR-IOV with InfiniBand is inferior to native (non-virtualized) mode. We also evaluate the trade-offs of various VM to CPU mapping policies on modern multi-core architectures and present our experiences.

cluster computing and the grid | 2012

Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports

Jithin Jose; Hari Subramoni; Krishna Chaitanya Kandalla; Md. Wasi-ur-Rahman; Hao Wang; Sundeep Narravula; Dhabaleswar K. Panda

Mem cached is a general-purpose key-value based distributed memory object caching system. It is widely used in data-center domain for caching results of database calls, API calls or page rendering. An efficient Mem cached design is critical to achieve high transaction throughput and scalability. Previous research in the field has shown that the use of high performance interconnects like InfiniBand can dramatically improve the performance of Mem cached. The Reliable Connection (RC) is the most commonly used transport model for InfiniBand implementations. However, it has been shown that RC transport imposes scalability issues due to high memory consumption per connection. Such a characteristic is not favorable for middle wares like Mem cached, where the server is required to serve thousands of clients. The Unreliable Datagram (UD) transport offers higher scalability, but has several other limitations, which need to be efficiently handled. In this context, we introduce a hybrid transport model which takes advantage of the best features of RC and UD to deliver scalability and performance higher than that of a single-transport. To the best of our knowledge, this is the first effort aimed at studying the impact of using a hybrid of multiple transport protocols on Mem cached performance. We present comprehensive performance analysis using micro benchmarks, application benchmarks and realistic industry workloads. Our performance evaluations reveal that our Hybrid transport delivers performance comparable to that of RC, while maintaining a steady memory footprint. Mem cached Get latency for 4byte data size, is 4.28μs and 4.86μs for RC and hybrid transports, respectively. This represents a factor of twelve improvement over the performance of SDP. In evaluations using Apache Olio benchmark with 1,024 clients, Mem cached execution time using RC, UD and hybrid transports are 1.61, 1.96 and 1.70 seconds, respectively. Further, our scalability analysis with 4,096 client connections reveal that our proposed hybrid transport achieves good memory scalability.

international supercomputing conference | 2013

Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models

Jithin Jose; Sreeram Potluri; Karen Tomko; Dhabaleswar K. Panda

MPI has been the de-facto programming model for scientific parallel applications. However, it is hard to extract the maximum performance for irregular data-driven applications using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. The lower overhead in one-sided communication and the global view of data in PGAS models have the potential to increase the performance at scale. In this study, we take up ‘Concurrent Search’ kernel of Graph500 — a highly data driven irregular benchmark — and redesign it using both MPI and OpenSHMEM constructs. We also implement load balancing in Graph500. Our performance evaluations using MVAPICH2-X (Unified MPI+PGAS Communication Runtime over InfiniBand) indicate a 59% reduction in execution time for the hybrid design, compared to the best performing MPI based design at 8,192 cores.

international conference on parallel processing | 2012

Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation

Jithin Jose; Krishna Chaitanya Kandalla; Miao Luo; Dhabaleswar K. Panda

Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. Open SHMEM is a library-based implementation of the PGAS model and it aims to standardize the SHMEM model to achieve performance, programmability and portability. However, Open SHMEM is an emerging standard and it is unlikely that entire an application will be re-written with it. Instead, it is more likely that applications will continue to be written with MPI as the primary model, but parts of them will be re-designed with newer models. This requires the underlying communication libraries to be designed with support for multiple programming models. In this paper, we propose a high performance, scalable unified communication library that supports both MPI and Open SHMEM for InfiniBand clusters. To the best of our knowledge, this is the first effort in unifying MPI and Open SHMEM communication libraries. Our proposed designs take advantage of InfiniBands advanced features to significantly improve the communication performance of various atomic and collective operations defined in Open SHMEM specification. Hybrid (MPI+Open SHMEM) parallel applications can benefit from our proposed library to achieve better efficiency and scalability. Our studies show that our proposed designs can improve the performance of OpenSHMEMs atomic operations and collective operations by up to 41%. We observe that our designs improve the performance of the 2D-Heat Modeling benchmark (pure Open-SHMEM) by up to 45%. We also observe that our unified communication library can improve the performance of the hybrid (MPI+Open SHMEM) version of Graph500 benchmark by up to 35%. Moreover, our studies also indicate that our proposed designs lead to lower memory consumption due to efficient utilization of the network resources.

international conference on parallel processing | 2012

SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks

Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Jithin Jose; Miao Luo; Hao Wang; Dhabaleswar K. Panda

Many applications cache huge amount of data in RAM to achieve high performance. A good example is Memcached, a distributed-memory object-caching software. Memcached performance directly depends on the aggregated memory pool size. Given the constraints of hardware cost, power/thermal concerns and floor plan limits, it is difficult to further scale the memory pool by packing more RAM into individual servers, or by expanding the server array horizontally. In this paper, we propose an SSD-Assisted Hybrid Memory that expands RAM with SSD to make available a large amount of memory. Hybrid memory works as an object cache and it manages resource allocation at object granularity, which is more efficient than allocation at page granularity. It leverages the SSD fast random read property to achieve low latency object access. It organizes SSD into a log-structured sequence of blocks to overcome SSD writing anomalies. Compared to alternatives that use SSD as a virtual memory swap device, hybrid memory reduces the random access latency by 68% and 72% for read and write operations, and improves operation throughput by 15.3 times. Additionally, it reduces write traffic to SSD by 81%, which implies a 5.3 times improvement in SSD lifetime. We have integrated our hybrid memory design into Memcached. Our experiments indicate a 3.7X reduction in Memcached Get operation latency and up to 5.3X improvement in operation throughput. To the best of our knowledge, this paper is the first work that integrates the cutting edge SSD and InfiniBand-verbs into Memcached to accelerate its performance.

Explore More