Nusrat Sharmin Islam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nusrat Sharmin Islam is active.

Explore More

Publication

Featured researches published by Nusrat Sharmin Islam.

ieee international conference on high performance computing data and analytics | 2012

High performance RDMA-based design of HDFS over InfiniBand

Nusrat Sharmin Islam; Md. Wasi-ur Rahman; Jithin Jose; Raghunath Rajachandrasekar; Hao Wang; Hari Subramoni; Chet Murthy; Dhabaleswar K. Panda

Hadoop Distributed File System (HDFS) acts as the primary storage of Hadoop and has been adopted by reputed organizations (Facebook, Yahoo! etc.) due to its portability and fault-tolerance. The existing implementation of HDFS uses Javasocket interface for communication which delivers suboptimal performance in terms of latency and throughput. For dataintensive applications, network performance becomes key component as the amount of data being stored and replicated to HDFS increases. In this paper, we present a novel design of HDFS using Remote Direct Memory Access (RDMA) over InfiniBand via JNI interfaces. Experimental results show that, for 5GB HDFS file writes, the new design reduces the communication time by 87% and 30% over 1Gigabit Ethernet (1GigE) and IP-over-InfiniBand (IPoIB), respectively, on QDR platform (32Gbps). For HBase, the Put operation performance is improved by 26% with our design. To the best of our knowledge, this is the first design of HDFS over InfiniBand networks.

international conference on parallel processing | 2011

Memcached Design on High Performance RDMA Capable Interconnects

Jithin Jose; Hari Subramoni; Miao Luo; Minjia Zhang; Jian Huang; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Hao Wang; Sayantan Sur; Dhabaleswar K. Panda

Memcached is a key-value distributed memory object caching system. It is used widely in the data-center environment for caching results of database calls, API calls or any other data. Using Memcached, spare memory in data-center servers can be aggregated to speed up lookups of frequently accessed information. The performance of Memcached is directly related to the underlying networking technology, as workloads are often latency sensitive. The existing Memcached implementation is built upon BSD Sockets interface. Sockets offers byte-stream oriented semantics. Therefore, using Sockets, there is a conversion between Memcacheds memory-object semantics and Sockets byte-stream semantics, imposing an overhead. This is in addition to any extra memory copies in the Sockets implementation within the OS. Over the past decade, high performance interconnects have employed Remote Direct Memory Access (RDMA) technology to provide excellent performance for the scientific computation domain. In addition to its high raw performance, the memory-based semantics of RDMA fits very well with Memcacheds memory-object model. While the Sockets interface can be ported to use RDMA, it is not very efficient when compared with low-level RDMA APIs. In this paper, we describe a novel design of Memcached for RDMA capable networks. Our design extends the existing open-source Memcached software and makes it RDMA capable. We provide a detailed performance comparison of our Memcached design compared to unmodified Memcached using Sockets over RDMA and 10Gigabit Ethernet network with hardware-accelerated TCP/IP. Our performance evaluation reveals that latency of Memcached Get of 4KB size can be brought down to 12 µs using ConnectX InfiniBand QDR adapters. Latency of the same operation using older generation DDR adapters is about 20µs. These numbers are about a factor of four better than the performance obtained by using 10GigE with TCP Offload. In addition, these latencies of Get requests over a range of message sizes are better by a factor of five to ten compared to IP over InfiniBand and Sockets Direct Protocol over InfiniBand. Further, throughput of small Get operations can be improved by a factor of six when compared to Sockets over 10 Gigabit Ethernet network. Similar factor of six improvement in throughput is observed over Sockets Direct Protocol using ConnectX QDR adapters. To the best of our knowledge, this is the first such memcached design on high performance RDMA capable interconnects.

international conference on parallel processing | 2013

High-Performance Design of Hadoop RPC with RDMA over InfiniBand

Xiaoyi Lu; Nusrat Sharmin Islam; Md. Wasi-ur-Rahman; Jithin Jose; Hari Subramoni; Hao Wang; Dhabaleswar K. Panda

Hadoop RPC is the basic communication mechanism in the Hadoop ecosystem. It is used with other Hadoop components like MapReduce, HDFS, and HBase in real world data-centers, e.g. Facebook and Yahoo!. However, the current Hadoop RPC design is built on Java sockets interface, which limits its potential performance. The High Performance Computing community has exploited high throughput and low latency networks such as InfiniBand for many years. In this paper, we first analyze the performance of current Hadoop RPC design by unearthing buffer management and communication bottlenecks, that are not apparent on the slower speed networks. Then we propose a novel design (RPCoIB) of Hadoop RPC with RDMA over InfiniBand networks. RPCoIB provides a JVM-bypassed buffer management scheme and utilizes message size locality to avoid multiple memory allocations and copies in data serialization and deserialization. Our performance evaluations reveal that the basic ping-pong latencies for varied data sizes are reduced by 42%-49% and 46%-50% compared with 10GigE and IPoIB QDR (32Gbps), respectively, while the RPCoIB design also improves the peak throughput by 82% and 64% compared with 10GigE and IPoIB. As compared to default Hadoop over IPoIB QDR, our RPCoIB design improves the performance of the Sort benchmark on 64 compute nodes by 15%, while it improves the performance of CloudBurst application by 10%. We also present thorough, integrated evaluations of our RPCoIB design with other research directions, which optimize HDFS and HBase using RDMA over InfiniBand. Compared with their best performance, we observe 10% improvement for HDFS-IB, and 24% improvement for HBase-IB. To the best of our knowledge, this is the first such design of the Hadoop RPC system over high performance networks such as InfiniBand.

high performance interconnects | 2014

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Xiaoyi Lu; Md. Wasi-ur Rahman; Nusrat Sharmin Islam; Dipti Shankar; Dhabaleswar K. Panda

Apache Hadoop Map Reduce has been highly successful in processing large-scale, data-intensive batch applications on commodity clusters. However, for low-latency interactive applications and iterative computations, Apache Spark, an emerging in-memory processing framework, has been stealing the limelight. Recent studies have shown that current generation Big Data frameworks (like Hadoop) cannot efficiently leverage advanced features (e.g. RDMA) on modern clusters with high-performance networks. One of the major bottlenecks is that these middleware are traditionally written with sockets and do not deliver the best performance on modern HPC systems with RDMA-enabled high-performance interconnects. In this paper, we first assess the opportunities of bringing the benefits of RDMA into the Spark framework. We further propose a high-performance RDMA-based design for accelerating data shuffle in the Spark framework on high-performance networks. Performance evaluations show that our proposed design can achieve 79-83% performance improvement for Group By, compared with the default Spark running with IP over Infini Band (IPoIB) FDR on a 128-256 core cluster. We adopt a plug-in-based approach that can make our design to be easily integrated with newer Spark releases. To the best our knowledge, this is the first design for accelerating Spark with RDMA for Big Data processing.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand

Wasi-ur-Rahman; Nusrat Sharmin Islam; Xiaoyi Lu; Jithin Jose; Hari Subramoni; Hao Wang; Dhabaleswar K. Panda

MapReduce is a very popular programming model used to handle large datasets in enterprise data centers and clouds. Although various implementations of MapReduce exist, Hadoop MapReduce is the most widely used in large data centers like Facebook, Yahoo! and Amazon due to its portability and fault tolerance. Network performance plays a key role in determining the performance of data intensive applications using Hadoop MapReduce as data required by the map and reduce processes can be distributed across the cluster. In this context, data center designers have been looking at high performance interconnects such as InfiniBand to enhance the performance of their Hadoop MapReduce based applications. However, achieving better performance through usage of high performance interconnects like InfiniBand is a significant task. It requires a careful redesign of communication framework inside MapReduce. Several assumptions made for current socket based communication in the current framework do not hold true for high performance interconnects. In this paper, we propose the design of an RDMA-based Hadoop MapReduce over InfiniBand and several design elements: data shuffle over InfiniBand, in-memory merge mechanism for the Reducer, and pre-fetch data for the Mapper. We perform our experiments on native InfiniBand using Remote Direct Memory Access (RDMA) and compare our results with that of Hadoop-A [1] and default Hadoop over different interconnects and protocols. For all these experiments, we perform network level parameter tuning and use optimum values for each Hadoop design. Our performance results show that, for a 100GB TeraSort running on an eight node cluster, we achieve a performance improvement of 32% over IP-over InfiniBand (IPoIB) and 21% over Hadoop-A. With multiple disks per node, this benefit rises up to 39% over IPoIB and 31% over Hadoop-A.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture

Nusrat Sharmin Islam; Xiaoyi Lu; Md. Wasi-ur-Rahman; Dipti Shankar; Dhabaleswar K. Panda

HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.

international conference on supercomputing | 2014

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects

Wasi-ur Rahman; Xiaoyi Lu; Nusrat Sharmin Islam; Dhabaleswar K. Panda

Hadoop MapReduce is the most popular open-source parallel programming model extensively used in Big Data analytics. Although fault tolerance and platform independence make Hadoop MapReduce the most popular choice for many users, it still has huge performance improvement potentials. Recently, RDMA-based design of Hadoop MapReduce has alleviated major performance bottlenecks with the implementation of many novel design features such as in-memory merge, prefetching and caching of map outputs, and overlapping of merge and reduce phases. Although these features reduce the overall execution time for MapReduce jobs compared to the default framework, further improvement is possible if shuffle and merge phases can also be overlapped with the map phase during job execution. In this paper, we propose HOMR (a Hybrid approach to exploit maximum Overlapping in MapReduce), that incorporates not only the features implemented in RDMA-based design, but also exploits maximum possible overlapping among all different phases compared to current best approaches. Our solution introduces two key concepts: Greedy Shuffle Algorithm and On-demand Shuffle Adjustment, both of which are essential to achieve significant performance benefits over the default MapReduce framework. Architecture of HOMR is generalized enough to provide performance efficiency both over different Sockets interface as well as previous RDMA-based designs over InfiniBand. Performance evaluations show that HOMR with RDMA over InfiniBand can achieve performance benefits of 54% and 56% compared to default Hadoop over IPoIB (IP over InfiniBand) and 10GigE, respectively. Compared to the previous best RDMA-based designs, this benefit is 29%. HOMR over Sockets also achieves a maximum of 38-40% benefit compared to default Hadoop over Sockets interface. We also evaluate our design with real-world workloads like SWIM and PUMA, and observe benefits of up to 16% and 18%, respectively, over the previous best-case RDMA-based design. To the best of our knowledge, this is the first approach to achieve maximum possible overlapping for MapReduce framework.

high performance interconnects | 2012

Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems

Jérôme Vienne; Jitong Chen; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Hari Subramoni; Dhabaleswar K. Panda

Communication interfaces of high performance computing (HPC) systems and clouds have been continually evolving to meet the ever increasing communication demands being placed on them by HPC applications and cloud computing middleware (e.g., Hadoop). The PCIe interfaces can now deliver speeds up to 128 Gbps (Gen3) and high performance interconnects (10/40 GigE, InfiniBand 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE RDMA over Converged Ethernet) are capable of delivering speeds from 10 to 54 Gbps. However, no previous study has demonstrated how much benefit an end user in the HPC / cloud computing domain can expect by utilizing newer generations of these interconnects over older ones or how one type of interconnect (such as IB) performs in comparison to another (such as RoCE).In this paper we evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middleware. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications.

international parallel and distributed processing symposium | 2015

High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA

Md. Wasi-ur-Rahman; Xiaoyi Lu; Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Dhabaleswar K. Panda

The viability and benefits of running MapReduce over modern High Performance Computing (HPC) clusters, with high performance interconnects and parallel file systems, have attracted much attention in recent times due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Most HPC clusters follow the traditional Beowulf architecture with a separate parallel storage system (e.g. Lustre) and either no, or very limited, local storage. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage system in HPC clusters poses many new opportunities and challenges. In this paper, we propose a novel high-performance design for running YARN MapReduce on such HPC clusters by utilizing Lustre as the storage provider for intermediate data. We identify two different shuffle strategies, RDMA and Lustre Read, for this architecture and provide modules to dynamically detect the best strategy for a given scenario. Our results indicate that due to the performance characteristics of the underlying Lustre setup, one shuffle strategy may outperform another in different HPC environments, and our dynamic detection mechanism can deliver best performance based on the performance characteristics obtained during runtime of job execution. Through this design, we can achieve 44% performance benefit for shuffle-intensive workloads in leadership-class HPC systems. To the best of our knowledge, this is the first attempt to exploit performance characteristics of alternate shuffle strategies for YARN MapReduce with Lustre and RDMA.

international conference on parallel processing | 2012

SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks

Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Jithin Jose; Miao Luo; Hao Wang; Dhabaleswar K. Panda

Many applications cache huge amount of data in RAM to achieve high performance. A good example is Memcached, a distributed-memory object-caching software. Memcached performance directly depends on the aggregated memory pool size. Given the constraints of hardware cost, power/thermal concerns and floor plan limits, it is difficult to further scale the memory pool by packing more RAM into individual servers, or by expanding the server array horizontally. In this paper, we propose an SSD-Assisted Hybrid Memory that expands RAM with SSD to make available a large amount of memory. Hybrid memory works as an object cache and it manages resource allocation at object granularity, which is more efficient than allocation at page granularity. It leverages the SSD fast random read property to achieve low latency object access. It organizes SSD into a log-structured sequence of blocks to overcome SSD writing anomalies. Compared to alternatives that use SSD as a virtual memory swap device, hybrid memory reduces the random access latency by 68% and 72% for read and write operations, and improves operation throughput by 15.3 times. Additionally, it reduces write traffic to SSD by 81%, which implies a 5.3 times improvement in SSD lifetime. We have integrated our hybrid memory design into Memcached. Our experiments indicate a 3.7X reduction in Memcached Get operation latency and up to 5.3X improvement in operation throughput. To the best of our knowledge, this paper is the first work that integrates the cutting edge SSD and InfiniBand-verbs into Memcached to accelerate its performance.

Explore More