Rohan Garg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rohan Garg is active.

Explore More

Publication

Featured researches published by Rohan Garg.

international conference on cluster computing | 2013

Checkpoint-restart for a network of virtual machines

Rohan Garg; Komal Sodha; Zhengping Jin; Gene Cooperman

The ability to easily deploy parallel computations on the Cloud is becoming ever more important. The first uniform mechanism for checkpointing a network of virtual machines is described. This is important for the parallel versions of common productivity software. Potential examples of parallelism include Simulink for MATLAB, parallel R for the R statistical modelling language, parallel blast.py for the BLAST bioinformatics software, IPython.parallel for Python, and GNU parallel for parallel shells. The checkpoint mechanism is implemented as a plugin in the DMTCP checkpoint-restart package. It operates on KVM/QEMU, and has also been adapted to Lguest and pure user-space QEMU. The plugin is surprisingly compact, comprising just 400 lines of code to checkpoint a single virtual machine, and 200 lines of code for a plugin to support saving and restoring network state. Incremental checkpoints of the associated virtual filesystem are accommodated through the Btrfs filesystem. Experiments demonstrate checkpoint times of a fraction of a second by using forked checkpointing, mmap-based restart, and incremental Btrfs-based snapshots.

international conference on cluster computing | 2012

Towards Fault-Tolerant Energy-Efficient High Performance Computing in the Cloud

Kurt L. Keville; Rohan Garg; David J. Yates; Kapil Arya; Gene Cooperman

In cluster computing, power and cooling represent a significant cost compared to the hardware itself. This is of special concern in the cloud, which provides access to large numbers of computers. We examine the use of ARM-based clusters for low-power, high performance computing. This work examines two likely use-modes: (i) a standard dedicated cluster, and (ii) a cluster of pre-configured virtual machines in the cloud. A 40-node department-level cluster based on an ARM Cortex-A9 is compared against a similar cluster based on an Intel Core2 Duo, in contrast to a recent similar study on just a 4-node cluster. For the NAS benchmarks on 32-node clusters, ARM was found to have a power efficiency ranging from 1.3 to 6.2 times greater than that of Intel. This is despite Intels approximately five times greater performance. The particular efficiency ratio depends primarily on the size of the working set relative to L2 cache. In addition to energy-efficient computing, this study also emphasizes fault tolerance: an important ingredient in high performance computing. It relies on two recent extensions to the DMTCP checkpoint-restart package. DMTCP was extended (i) to support ARM CPUs, and (ii) to support check pointing of the Qemu virtual machine in user-mode. DMTCP is used both to checkpoint native distributed applications, and to checkpoint a network of virtual machines. This latter case demonstrates the ability to deploy pre-configured software in virtual machines hosted in the cloud, and further to migrate cluster computation between hosts in the cloud.

international conference on cluster computing | 2016

Design and Implementation for Checkpointing of Distributed Resources Using Process-Level Virtualization

Kapil Arya; Rohan Garg; Artem Y. Polyakov; Gene Cooperman

System-level checkpoint-restart is a critical technology for long-running jobs in high-performance computing. Yet, only two approaches to checkpointing MPI applications continue to survive in wide use today. One approach is to use the kernel module-based BLCR in combination with an MPI checkpoint-restart service particular to the MPI implementation in use. Unfortunately, this lacks support for some important Linux system services such as SysV IPC (e.g., shared memory objects). A second approach has been to use the original 2009 DMTCP implementation (herein referred to as DMTCP-09) for transparent, system-level checkpointing. Unfortunately, DMTCP-09 lacked support for checkpointing many of the necessary features found by MPI in a modern batch environment. These include: ssh, the InfiniBand network, process migration (restarting an MPI application on different cluster nodes), and modified file path prefixes on restart (typically due to a changing current directory, mount points, library paths, etc.). This work presents DMTCP-PV, a new user-space transparent checkpointing system based on the concept of process virtualization. This approach separately models the state of each local or distributed subsystem while decoupling it from the core checkpointing engine. By separating these concerns, a domain expert can extend checkpointing into a new domain without any knowledge of the core checkpointing engine. This allowed DMTCP-PV to address the deficiencies noted above and many others. It is shown that the runtime overhead of DMTCP-PV is generally less than 1%, and the checkpointing time is dominated by the time to write an image file to stable storage.

international conference on communication systems and network technologies | 2012

Optimal Time-Table Generation by Hybridized Bacterial Foraging and Genetic Algorithms

Om Prakash Verma; Rohan Garg; Vikram Singh Bisht

Timetable scheduling is a highly constrained combinatorial NP-hard problem as has been described in the literature. A lot of constraints need to be accommodated for development of an efficient algorithm. This paper presents a hybrid approach to time table scheduling problem using bacterial foraging and genetic algorithm techniques. In the proposed algorithm, a bacterium represents a point in n-dimensional search space where each point is a potential solution to the timetable problem. The foraging behavior of E. Coli bacteria is simulated to search for an optimal solution. Genetic algorithm is used at the chemo taxis stage to give sense of biased-movement to the bacteria. Simulation results indicate that the proposed algorithm performs better as compared to the algorithms available in literature.

international conference on parallel and distributed systems | 2016

System-Level Scalable Checkpoint-Restart for Petascale Computing

Jiajun Cao; Kapil Arya; Rohan Garg; L. Shawn Matott; Dhabaleswar K. Panda; Hari Subramoni; Jérôme Vienne; Gene Cooperman

Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that InfiniBand UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1%. The approach is also evaluated across three widely used MPI implementations.

Workshop on OpenSHMEM and Related Technologies | 2016

System-Level Transparent Checkpointing for OpenSHMEM

Rohan Garg; Jérôme Vienne; Gene Cooperman

Fault tolerance is an active area of research for OpenSHMEM programs. In this work, we present the first approach using system-level transparent checkpointing. This complements an existing approach based on application-level checkpointing. Application-level checkpointing has advantages for algorithm-based fault tolerance, while transparent checkpointing can be invoked by the system at an arbitrary time. Unlike the earlier application-level work of Hao et al., this system-level approach creates checkpoint images in stable storage, thus enabling restart at a later time or even process migration. An experimental evaluation is presented using NAS NPB benchmarks for OpenSHMEM. In order to support this work, The design of DMTCP (Distributed MultiThreaded CheckPointing) was extended to support shared memory regions in the absence of virtual memory.

Protein Science | 2018

Functional classification of protein structures by local structure matching in graph representation: Protein Function Prediction with Local Graphs

Caitlyn L. Mills; Rohan Garg; Joslynn S. Lee; Liang Tian; Alexander I. Suciu; Gene Cooperman; Penny J. Beuning; Mary Jo Ondrechen

As a result of high‐throughput protein structure initiatives, over 14,400 protein structures have been solved by Structural Genomics (SG) centers and participating research groups. While the totality of SG data represents a tremendous contribution to genomics and structural biology, reliable functional information for these proteins is generally lacking. Better functional predictions for SG proteins will add substantial value to the structural information already obtained. Our method described herein, Graph Representation of Active Sites for Prediction of Function (GRASP‐Func), predicts quickly and accurately the biochemical function of proteins by representing residues at the predicted local active site as graphs rather than in Cartesian coordinates. We compare the GRASP‐Func method to our previously reported method, Structurally Aligned Local Sites of Activity (SALSA), using the Ribulose Phosphate Binding Barrel (RPBB), 6‐Hairpin Glycosidase (6‐HG), and Concanavalin A‐like Lectins/Glucanase (CAL/G) superfamilies as test cases. In each of the superfamilies, SALSA and the much faster method GRASP‐Func yield similar correct classification of previously characterized proteins, providing a validated benchmark for the new method. In addition, we analyzed SG proteins using our SALSA and GRASP‐Func methods to predict function. Forty‐one SG proteins in the RPBB superfamily, nine SG proteins in the 6‐HG superfamily, and one SG protein in the CAL/G superfamily were successfully classified into one of the functional families in their respective superfamily by both methods. This improved, faster, validated computational method can yield more reliable predictions of function that can be used for a wide variety of applications by the community.

Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale | 2016

Extended Batch Sessions and Three-Phase Debugging: Using DMTCP to Enhance the Batch Environment

Rohan Garg; Jiajun Cao; Kapil Arya; Gene Cooperman; Jérôme Vienne

Batch environments are notoriously unfriendly because its not easy to interactively diagnose the health of a job. A job may be terminated without warning when it reaches the end of an allotted runtime slot, or it may terminate even sooner due to an unsuspected bug that occurs only at large scale. Two strategies are proposed that take advantage of DMTCP (Distributed MultiThreaded CheckPointing) for system-level checkpointing. First, we describe a three-phase debugging strategy that permits one to interactively debug long-running MPI applications that were developed for non-interactive batch environments. Second, we review how to use the SLURM resource manager capability to easily implement extended batch sessions that overcome the typical limitation of 24 hours maximum for a single batch job on large HPC resources. We argue for greater use of this lesser known capability, as a means to remove the necessity for the application-specific checkpointing found in many long-running jobs.

arXiv: Operating Systems | 2012