Dipti Shankar
Ohio State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dipti Shankar.
high performance interconnects | 2014
Xiaoyi Lu; Md. Wasi-ur Rahman; Nusrat Sharmin Islam; Dipti Shankar; Dhabaleswar K. Panda
Apache Hadoop Map Reduce has been highly successful in processing large-scale, data-intensive batch applications on commodity clusters. However, for low-latency interactive applications and iterative computations, Apache Spark, an emerging in-memory processing framework, has been stealing the limelight. Recent studies have shown that current generation Big Data frameworks (like Hadoop) cannot efficiently leverage advanced features (e.g. RDMA) on modern clusters with high-performance networks. One of the major bottlenecks is that these middleware are traditionally written with sockets and do not deliver the best performance on modern HPC systems with RDMA-enabled high-performance interconnects. In this paper, we first assess the opportunities of bringing the benefits of RDMA into the Spark framework. We further propose a high-performance RDMA-based design for accelerating data shuffle in the Spark framework on high-performance networks. Performance evaluations show that our proposed design can achieve 79-83% performance improvement for Group By, compared with the default Spark running with IP over Infini Band (IPoIB) FDR on a 128-256 core cluster. We adopt a plug-in-based approach that can make our design to be easily integrated with newer Spark releases. To the best our knowledge, this is the first design for accelerating Spark with RDMA for Big Data processing.
ieee/acm international symposium cluster, cloud and grid computing | 2015
Nusrat Sharmin Islam; Xiaoyi Lu; Md. Wasi-ur-Rahman; Dipti Shankar; Dhabaleswar K. Panda
HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.
international conference on big data | 2015
Nusrat Sharmin Islam; Md. Wasi-ur-Rahman; Xiaoyi Lu; Dipti Shankar; Dhabaleswar K. Panda
For data-intensive computing, the low throughput of the existing disk-bound storage systems is a major bottleneck. Recent emergence of the in-memory file systems with heterogeneous storage support mitigates this problem to a great extent. Parallel programming frameworks, e.g. Hadoop MapReduce and Spark are increasingly being run on such high-performance file systems. However, no comprehensive study has been done to analyze the impacts of the in-memory file systems on various Big Data applications. This paper characterizes two file systems in literature, Tachyon [17] and Triple-H [13] that support in-memory and heterogeneous storage, and discusses the impacts of these two architectures on the performance and fault tolerance of Hadoop MapReduce and Spark applications. We present a complete methodology for evaluating MapReduce and Spark workloads on top of in-memory file systems and provide insights about the interactions of different system components while running these workloads. We also propose advanced acceleration techniques to adapt Triple-H for iterative applications and study the impact of different parameters on the performance of MapReduce and Spark jobs on HPC systems. Our evaluations show that, although Tachyon is 5x faster than HDFS for primitive operations, Triple-H performs 47% and 2.4x better than Tachyon for MapReduce and Spark workloads, respectively. Triple-H also accelerates K-Means by 15% over HDFS and 9% over Tachyon.
international conference on parallel processing | 2015
Nusrat Sharmin Islam; Dipti Shankar; Xiaoyi Lu; Md. Wasi-ur-Rahman; Dhabaleswar K. Panda
Hadoop Distributed File System (HDFS) is the underlying storage engine of many Big Data processing frameworks such as Hadoop MapReduce, HBase, Hive, and Spark. Even though HDFS is well-known for its scalability and reliability, the requirement of large amount of local storage space makes HDFS deployment challenging on HPC clusters. Moreover, HPC clusters usually have large installation of parallel file system like Lustre. In this study, we propose a novel design to integrate HDFS with Lustre through a high performance key-value store. We design a burst buffer system using RDMA-based Mem cached and present three schemes to integrate HDFS with Lustre through this buffer layer, considering different aspects of I/O, data-locality, and fault-tolerance. Our proposed schemes can ensure performance improvement for Big Data applications on HPC clusters. At the same time, they lead to reduced local storage requirement. Performance evaluations show that, our design can improve the write performance of Test DFSIO by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x. Sort execution time is reduced by up to 28% over Lustre and 19% over HDFS. Our design can also significantly benefit I/O-intensive workloads compared to both HDFS and Lustre.
architectural support for programming languages and operating systems | 2014
Dipti Shankar; Xiaoyi Lu; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Dhabaleswar K. Panda
Hadoop MapReduce is increasingly being used by many data-centers (e.g. Facebook, Yahoo!) because of its simplicity, productivity, scalability, and fault tolerance. For MapReduce applications, achieving low job execution time is critical. Since a majority of the existing clusters today are equipped with modern, high-speed interconnects such as InfiniBand and 10 GigE, that offer high bandwidth and low communication latency, it is essential to study the impact of network configuration on the communication patterns of the MapReduce job. However, a standardized benchmark suite that focuses on helping users evaluate the performance of the stand-alone Hadoop MapReduce component is not available in the current Apache Hadoop community. In this paper, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, with different intermediate data distribution patterns, varied key/value sizes, and data types. We also show how this micro-benchmark suite can be used to evaluate the performance of Hadoop MapReduce over different networks/protocols and parameter configurations on modern clusters. The micro-benchmark suite is designed to be compatible with both Hadoop 1.x and Hadoop 2.x.
international conference on big data | 2016
Xiaoyi Lu; Dipti Shankar; Shashank Gugnani; Dhabaleswar K. Panda
The in-memory data processing framework, Apache Spark, has been stealing the limelight for low-latency interactive applications, iterative and batch computations. Our early experience study [17] has shown that Apache Spark can be enhanced to leverage advanced features (e.g., RDMA) on highperformance networks (e.g., InfiniBand and RoCE) to improve the performance of shuffle phase. With the fast evolving of the Apache Spark ecosystem, the Spark architecture has been changing a lot. This motivates us to investigate whether the earlier RDMA design can be adapted and further enhanced for the new Apache Spark architecture. We also aim to improve the performance for various Spark workloads (e.g., Batch, Graph, SQL). In this paper, we present a detailed design for high-performance RDMA-based Apache Spark on high-performance networks. We conduct systematic performance evaluations on three modern clusters (Chameleon, SDSC Comet, and an in-house cluster) with cutting-edge InfiniBand technologies, such as latest IB EDR (100 Gbps) network, recently introduced Single Root I/O Virtualization (SR-IOV) technology for IB, etc. The evaluation results show that compared to the default Spark running with IP over InfiniBand (IPoIB), our proposed design can achieve up to 79% performance improvement for Spark RDD operation benchmarks (e.g., GroupBy, SortBy), up to 38% performance improvement for batch workloads (e.g., Sort and TeraSort in Intel HiBench), up to 46% performance improvement for graph processing workloads (e.g., PageRank), up to 32% performance improvement for SQL queries (e.g., Aggregation, Join) on varied scales (up to 1,536 cores) of bare-metal IB clusters. Performance evaluations on SR-IOV enabled IB clusters also show 37% improvement achieved by our RDMA-based design. Our RDMA-based Spark design is implemented as a pluggable module and it does not change any Spark APIs, which means that it can be combined with other existing enhanced designs for Apache Spark and Hadoop proposed in the community. To show this, we further evaluate the performance of a combined version of ‘RDMA-Spark+RDMA-HDFS’ and the numbers show that the combination can achieve the best performance with up to 82% improvement for Intel HiBench Sort and TeraSort on SDSC Comet cluster.
international conference on big data | 2016
Dipti Shankar; Xiaoyi Lu; Dhabaleswar K. Panda
The limitation of local storage space in the HPC environments has placed an unprecedented demand on the performance of the underlying shared parallel file systems. This has necessitated a scalable solution for running Big Data middleware (e.g., Hadoop) on HPC clusters. In this paper, we propose Boldio, a hybrid and resilient key-value storebased Burst-Buffer system Over Lustre for accelerating I/O-intensive Big Data workloads, that can leverage RDMA on high-performance interconnects and storage technologies such as PCIe-/NVMe-SSDs, etc. We demonstrate that Boldio can improve the performance of the I/O phase of Hadoop workloads running on HPC clusters; serving as a light-weight, high-performance, and resilient remote I/O staging layer between the application and Lustre. Performance evaluations show that Boldio can improve the TestDFSIO write performance over Lustre by up to 3x and TestDFSIO read performance by 7x, while reducing the execution time of Hadoop Sort benchmark by up to 30%. We demonstrate that we can significantly improve Hadoop I/O throughput over popular in-memory distributed storage systems such as Alluxio (formerly Tachyon), when high-speed local storage is limited.
ieee international conference on cloud computing technology and science | 2016
Xiaoyi Lu; Dipti Shankar; Shashank Gugnani; Hari Subramoni; Dhabaleswar K. Panda
The performance of Hadoop components can be significantly improved by leveraging advanced features such as Remote Direct Memory Access (RDMA) on modern HPC clusters, where high-performance networks like InfiniBand (IB) and RoCE have been deployed widely. With the emergence of high-performance computing in the cloud (HPC Cloud), high-performance networks have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology. With these advancements in HPC Cloud networking technologies, it is high time to investigate the design opportunities and impact of networking architectures (different generations of IB, 40GigE, 40G-RoCE) and protocols (TCP/IP, IPoIB, RC, UD, Hybrid) in accelerating Hadoop components over high-performance networks. In this paper, we propose a network architecture and multi-protocol aware Hadoop RPC design, that can take advantage of RC and UD protocols for IB and RoCE. A hybrid transport design with RC and UD is proposed which can deliver memory scalability and performance for Hadoop RPC. We present a comprehensive performance analysis on five bare-metal IB/RoCE clusters and one SR-IOV enabled cluster in the Chameleon Cloud. Our performance evaluations reveal that our proposed designs can achieve up to 12.5x performance improvement for Hadoop RPC over IPoIB. Further, we integrate our RPC engine into Apache HBase, and demonstrate that we can accelerate YCSB workloads by up to 3.6x. Other insightful observations on performance characteristics of different HPC Cloud networking technologies are also shared in this paper.
international symposium on performance analysis of systems and software | 2015
Dipti Shankar; Xiaoyi Lu; Jithin Jose; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Dhabaleswar K. Panda
At the onset of the widespread usage of social networking services in the Web 2.0/3.0 era, leveraging a distributed and scalable caching layer like Memcached is often invaluable to application server performance. Since a majority of the existing clusters today are equipped with modern high speed interconnects such as InfiniBand, that offer high bandwidth and low latency communication, there is potential to improve the response time and throughput of the application servers, by taking advantage of advanced features like RDMA. We explore the potential of employing RDMA to improve the performance of Online Data Processing (OLDP) workloads on MySQL using Memcached for real-world web applications.
international conference on distributed computing systems | 2017
Dipti Shankar; Xiaoyi Lu; Dhabaleswar K. Panda
Distributed key-value store-based caching solutions are being increasingly used to accelerate Big Data applications on modern HPC clusters. This has necessitated incorporating fault-tolerance capabilities into high-performance key-value stores such as Memcached that are otherwise volatile in nature. In-memory replication is being used as the primary mechanism to ensure resilient data operations. However, this incurs increased network I/O with high remote memory requirements. On the other hand, Erasure Coding is being extensively explored for enabling data resilience, while achieving better storage efficiency. In this paper, we first perform an in-depth modeling-based analysis of the performance trade-offs of In-Memory Replication and Erasure Coding schemes for key-value stores, and explore the possibilities of employing Online Erasure Coding for enabling resilience in high-performance key-value stores for HPC clusters. We then design a non-blocking API-based engine to perform efficient Set/Get operations by overlapping the encoding/decoding involved in enabling Erasure Coding-based resilience with the request/response phases, by leveraging RDMA on high performance interconnects. Performance evaluations show that the proposed designs can outperform synchronous RDMA-based replication by about 2.8x, and can improve YCSB throughput and average read/write latencies by about 1.34x - 2.6x over asynchronous replication for larger key-value pair sizes (>16KB). We also demonstrate its benefits by incorporating it into a hybrid and resilient key-value store-based burst-buffer system over Lustre for accelerating Big Data I/O on HPC clusters.