Raghunath Rajachandrasekar

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raghunath Rajachandrasekar is active.

Explore More

Publication

Featured researches published by Raghunath Rajachandrasekar.

ieee international conference on high performance computing data and analytics | 2012

High performance RDMA-based design of HDFS over InfiniBand

Nusrat Sharmin Islam; Md. Wasi-ur Rahman; Jithin Jose; Raghunath Rajachandrasekar; Hao Wang; Hari Subramoni; Chet Murthy; Dhabaleswar K. Panda

Hadoop Distributed File System (HDFS) acts as the primary storage of Hadoop and has been adopted by reputed organizations (Facebook, Yahoo! etc.) due to its portability and fault-tolerance. The existing implementation of HDFS uses Javasocket interface for communication which delivers suboptimal performance in terms of latency and throughput. For dataintensive applications, network performance becomes key component as the amount of data being stored and replicated to HDFS increases. In this paper, we present a novel design of HDFS using Remote Direct Memory Access (RDMA) over InfiniBand via JNI interfaces. Experimental results show that, for 5GB HDFS file writes, the new design reduces the communication time by 87% and 30% over 1Gigabit Ethernet (1GigE) and IP-over-InfiniBand (IPoIB), respectively, on QDR platform (32Gbps). For HBase, the Put operation performance is improved by 26% with our design. To the best of our knowledge, this is the first design of HDFS over InfiniBand networks.

high performance distributed computing | 2013

A 1 PB/s file system to checkpoint three million MPI tasks

Raghunath Rajachandrasekar; Adam Moody; Kathryn Mohror; Dhabaleswar K. Panda

With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external parallel file systems, but limited bandwidth makes this a time-consuming operation. Multilevel checkpointing systems, like the Scalable Checkpoint/Restart (SCR) library, alleviate this bottleneck by caching checkpoints in storage located close to the compute nodes. However, most large scale systems do not provide file storage on compute nodes, preventing the use of SCR. We have implemented a novel user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed. This technique extends the reach of libraries like SCR to systems where they otherwise could not be used. Furthermore, we expose file contents for Remote Direct Memory Access, allowing external tools to copy checkpoints to the parallel file system in the background with reduced CPU interruption. Our file system scales linearly with node count and delivers a 1~PB/s throughput at three million MPI processes, which is 20x faster than the system RAM disk and 1000x faster than the parallel file system.

international parallel and distributed processing symposium | 2015

High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA

Md. Wasi-ur-Rahman; Xiaoyi Lu; Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Dhabaleswar K. Panda

The viability and benefits of running MapReduce over modern High Performance Computing (HPC) clusters, with high performance interconnects and parallel file systems, have attracted much attention in recent times due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Most HPC clusters follow the traditional Beowulf architecture with a separate parallel storage system (e.g. Lustre) and either no, or very limited, local storage. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage system in HPC clusters poses many new opportunities and challenges. In this paper, we propose a novel high-performance design for running YARN MapReduce on such HPC clusters by utilizing Lustre as the storage provider for intermediate data. We identify two different shuffle strategies, RDMA and Lustre Read, for this architecture and provide modules to dynamically detect the best strategy for a given scenario. Our results indicate that due to the performance characteristics of the underlying Lustre setup, one shuffle strategy may outperform another in different HPC environments, and our dynamic detection mechanism can deliver best performance based on the performance characteristics obtained during runtime of job execution. Through this design, we can achieve 44% performance benefit for shuffle-intensive workloads in leadership-class HPC systems. To the best of our knowledge, this is the first attempt to exploit performance characteristics of alternate shuffle strategies for YARN MapReduce with Lustre and RDMA.

international conference on parallel processing | 2012

SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks

Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Jithin Jose; Miao Luo; Hao Wang; Dhabaleswar K. Panda

Many applications cache huge amount of data in RAM to achieve high performance. A good example is Memcached, a distributed-memory object-caching software. Memcached performance directly depends on the aggregated memory pool size. Given the constraints of hardware cost, power/thermal concerns and floor plan limits, it is difficult to further scale the memory pool by packing more RAM into individual servers, or by expanding the server array horizontally. In this paper, we propose an SSD-Assisted Hybrid Memory that expands RAM with SSD to make available a large amount of memory. Hybrid memory works as an object cache and it manages resource allocation at object granularity, which is more efficient than allocation at page granularity. It leverages the SSD fast random read property to achieve low latency object access. It organizes SSD into a log-structured sequence of blocks to overcome SSD writing anomalies. Compared to alternatives that use SSD as a virtual memory swap device, hybrid memory reduces the random access latency by 68% and 72% for read and write operations, and improves operation throughput by 15.3 times. Additionally, it reduces write traffic to SSD by 81%, which implies a 5.3 times improvement in SSD lifetime. We have integrated our hybrid memory design into Memcached. Our experiments indicate a 3.7X reduction in Memcached Get operation latency and up to 5.3X improvement in operation throughput. To the best of our knowledge, this paper is the first work that integrates the cutting edge SSD and InfiniBand-verbs into Memcached to accelerate its performance.

international conference on cluster computing | 2010

RDMA-Based Job Migration Framework for MPI over InfiniBand

Sonya Marcarelli; Raghunath Rajachandrasekar; Dhabaleswar K. Panda

Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly large-sized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters.

international conference on parallel processing | 2011

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Raghunath Rajachandrasekar; Xavier Besseron; Hao Wang; Jian Huang; Dhabaleswar K. Panda

Checkpoint/Restart (C/R) mechanisms have been widely adopted by many MPI libraries [1 -- 3] to achieve fault-tolerance. However, a major limitation of such mechanisms is the intensive IO bottleneck caused by the need to dump the snapshots of all processes into persistent storage. Several studies have been conducted to minimize this overhead [4, 5], but most of these proposed optimizations are performed inside specific MPI stack or check pointing library or applications, hence they are not portable enough to be applied to other MPI stacks and applications. In this paper, we propose a filesystem based approach to alleviate this checkpoint IO bottleneck. We propose a new filesystem, named Checkpoint-Restart File system (CRFS), which is a lightweight user-level filesystem based on FUSE (File system in User space). CRFS is designed with Checkpoint/Restart I/O traffic in mind to efficiently handle the concurrent write requests. Any software component using standard filesystem interfaces can transparently benefit from CRFSs capabilities. CRFS intercepts the checkpoint file write system calls and aggregates them into fewer bigger chunks which are asynchronously written to the underlying filesystem for more efficient IO. CRFS manages a ?exible internal IO thread pool to throttle concurrent IO to alleviate IO contention for better IO performance. CRFS can be mounted over any standard filesystem like ext3, NFS and Lustre. We have implemented CRFS and evaluated its performance using three popular C/R capable MPI stacks: MVAPICH2, MPICH2 and OpenMPI. Experimental results show significant performance gains for all three MPI stacks. CRFS achieves up to 5.5X speedup in checkpoint writing performance to Lustre filesystem. Similar level of improvements are also obtained with ext3 and NFS filesystems. To the best of our knowledge, this is the first such portable and light-weight filesystem designed for generic Checkpoint/Restart data.

european conference on parallel processing | 2014

MapReduce over Lustre: Can RDMA-Based Approach Benefit?

Md. Wasi-ur Rahman; Xiaoyi Lu; Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Dhabaleswar K. Panda

Recently, MapReduce is getting deployed over many High Performance Computing (HPC) clusters. Different studies reveal that by leveraging the benefits of high-performance interconnects like InfiniBand in these clusters, faster MapReduce job execution can be obtained by using additional performance enhancing features. Although RDMA-enhanced MapReduce has been proven to provide faster solutions over Hadoop distributed file system, efficiencies over parallel file systems used in HPC clusters are yet to be discovered. In this paper, we present a complete methodology for evaluating MapReduce over Lustre file system to provide insights about the interactions of different system components in HPC clusters. Our performance evaluation shows that RDMA-enhanced MapReduce can achieve significant benefits in terms of execution time (49% in a 128-node HPC cluster) and resource utilization, compared to the default architecture. To the best of our knowledge, this is the first attempt to evaluate RDMA-enhanced MapReduce over Lustre file system on HPC clusters.

international parallel and distributed processing symposium | 2012

Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI

Raghunath Rajachandrasekar; Xavier Besseron; Dhabaleswar K. Panda

Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. Several system middleware such as job schedulers and MPI implementations provide support for both reactive and proactive mechanisms to tolerate faults. These techniques rely on external components such as system logs and infrastructure monitors to provide information about hardware/software failure either through detection, or as a prediction. However, these middleware work in isolation, without disseminating the knowledge of faults encountered. In this context, we propose a light-weight multi-threaded service, namely FTB-IPMI, which provides distributed fault-monitoring using the Intelligent Platform Management Interface (IPMI) and coordinated propagation of fault information using the Fault-Tolerance Backplane (FTB). In essence, it serves as a middleman between system hardware and the software stack by translating raw hardware events to structured software events and delivering it to any interested component using a publish-subscribe framework. Fault-predictors and other decision-making engines that rely on distributed failure information can benefit from FTB-IPMI to facilitate proactive fault-tolerance mechanisms such as preemptive job migration. We have developed a fault-prediction engine within MVAPICH2, an RDMA-based MPI implementation, to demonstrate this capability. Failure predictions made by this engine are used to trigger migration of processes from failing nodes to healthy spare nodes, thereby providing resilience to the MPI application. Experimental evaluation clearly indicates that a single instance of FTB-IPMI can scale to several hundreds of nodes with a remarkably low resource-utilization footprint. A deployment of FTB-IPMI that services a cluster with 128 compute-nodes, sweeps the entire cluster and collects IPMI sensor information on CPU temperature, system voltages and fan speeds in about 0.75 seconds. The average CPU utilization of this service running on a single node is 0.35%.

ieee/acm international symposium cluster, cloud and grid computing | 2011

High Performance Pipelined Process Migration with RDMA

Raghunath Rajachandrasekar; Xavier Besseron; Dhabaleswar K. Panda

-- Coordinated Checkpoint/Restart (C/R) is a widely deployed strategy to achieve fault-tolerance. However, C/R by itself is not capable enough to meet the demands of upcoming exascale systems, due to its heavy I/O overhead. Process migration has already been proposed in literature as a pro-active fault-tolerance mechanism to complement C/R. Several popular MPI implementations have provided support for process migration, includingMVAPICH2 and Open MPI. But these existing solutions cannot yield a satisfactory performance. In this paper we conduct extensive profiling on several process migration mechanisms, and reveal that ineffi-cient I/O and network transfer are the principal factors responsible for the high overhead. We then propose anew approach, Pipelined Process Migration with RDMA(PPMR), to overcome these overheads. Our new protocol fully pipelines data writing, data transfer, and data read operations during different phases of a migration cycle. PPMR aggregates data writes on the migration source node and transfers data to the target node via high through put RDMA transport. It implements an efficient process restart mechanism at the target node to restart processes from the RDMA data streams. We have implemented this Pipelined Process Migration protocol in MVAPICH2 and studied the performance benefits. Experimental results show that PPMR achieves a 10.7X speedup to complete a process migration over the conventional approach at a moderate(8MB) memory usage. Process migration overhead on the application is significantly minimized from 38% to 5% by PPMR when three migrations are performed in succession.

international conference on big data | 2014

In-memory I/O and replication for HDFS with Memcached: Early experiences

Nusrat Sharmin Islam; Xiaoyi Lu; Md. Wasi-ur-Rahman; Raghunath Rajachandrasekar; Dhabaleswar K. Panda

Hadoop is the de-facto standard platform for large-scale data analytic applications. In spite of high availability and reliability guarantees, Hadoop Distributed File System (HDFS) suffers from huge I/O bottlenecks for storing the tri-replicated data blocks. The I/O overheads intrinsic to the HDFS architecture degrade the application performance. In this paper, we present a novel design (MEM-HDFS) to perform intelligent caching and replication of HDFS data blocks in Memcached that can significantly improve the I/O performance. In this design, we consider different deployment strategies for the Memcached servers (local and remote) and guarantee persistence of the Memcached data to HDFS on cache replacements. Performance evaluations show that MEM-HDFS can increase the read and write throughput of HDFS by up to 3.9x and 3.3x, respectively. Our design can also significantly speed up the data loading (to HDFS) phase. It reduces the execution times of data generation benchmarks like, TeraGen, RandomTextWriter, and RandomWriter by up to 50%, 39%, and 48%, respectively. The performances of other benchmarks like TeraSort and Grep are also improved by the proposed design.

Explore More