Xiaoyi Lu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaoyi Lu is active.

Explore More

Publication

Featured researches published by Xiaoyi Lu.

international conference on parallel processing | 2013

High-Performance Design of Hadoop RPC with RDMA over InfiniBand

Xiaoyi Lu; Nusrat Sharmin Islam; Md. Wasi-ur-Rahman; Jithin Jose; Hari Subramoni; Hao Wang; Dhabaleswar K. Panda

Hadoop RPC is the basic communication mechanism in the Hadoop ecosystem. It is used with other Hadoop components like MapReduce, HDFS, and HBase in real world data-centers, e.g. Facebook and Yahoo!. However, the current Hadoop RPC design is built on Java sockets interface, which limits its potential performance. The High Performance Computing community has exploited high throughput and low latency networks such as InfiniBand for many years. In this paper, we first analyze the performance of current Hadoop RPC design by unearthing buffer management and communication bottlenecks, that are not apparent on the slower speed networks. Then we propose a novel design (RPCoIB) of Hadoop RPC with RDMA over InfiniBand networks. RPCoIB provides a JVM-bypassed buffer management scheme and utilizes message size locality to avoid multiple memory allocations and copies in data serialization and deserialization. Our performance evaluations reveal that the basic ping-pong latencies for varied data sizes are reduced by 42%-49% and 46%-50% compared with 10GigE and IPoIB QDR (32Gbps), respectively, while the RPCoIB design also improves the peak throughput by 82% and 64% compared with 10GigE and IPoIB. As compared to default Hadoop over IPoIB QDR, our RPCoIB design improves the performance of the Sort benchmark on 64 compute nodes by 15%, while it improves the performance of CloudBurst application by 10%. We also present thorough, integrated evaluations of our RPCoIB design with other research directions, which optimize HDFS and HBase using RDMA over InfiniBand. Compared with their best performance, we observe 10% improvement for HDFS-IB, and 24% improvement for HBase-IB. To the best of our knowledge, this is the first such design of the Hadoop RPC system over high performance networks such as InfiniBand.

high performance interconnects | 2014

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Xiaoyi Lu; Md. Wasi-ur Rahman; Nusrat Sharmin Islam; Dipti Shankar; Dhabaleswar K. Panda

Apache Hadoop Map Reduce has been highly successful in processing large-scale, data-intensive batch applications on commodity clusters. However, for low-latency interactive applications and iterative computations, Apache Spark, an emerging in-memory processing framework, has been stealing the limelight. Recent studies have shown that current generation Big Data frameworks (like Hadoop) cannot efficiently leverage advanced features (e.g. RDMA) on modern clusters with high-performance networks. One of the major bottlenecks is that these middleware are traditionally written with sockets and do not deliver the best performance on modern HPC systems with RDMA-enabled high-performance interconnects. In this paper, we first assess the opportunities of bringing the benefits of RDMA into the Spark framework. We further propose a high-performance RDMA-based design for accelerating data shuffle in the Spark framework on high-performance networks. Performance evaluations show that our proposed design can achieve 79-83% performance improvement for Group By, compared with the default Spark running with IP over Infini Band (IPoIB) FDR on a 128-256 core cluster. We adopt a plug-in-based approach that can make our design to be easily integrated with newer Spark releases. To the best our knowledge, this is the first design for accelerating Spark with RDMA for Big Data processing.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand

Wasi-ur-Rahman; Nusrat Sharmin Islam; Xiaoyi Lu; Jithin Jose; Hari Subramoni; Hao Wang; Dhabaleswar K. Panda

MapReduce is a very popular programming model used to handle large datasets in enterprise data centers and clouds. Although various implementations of MapReduce exist, Hadoop MapReduce is the most widely used in large data centers like Facebook, Yahoo! and Amazon due to its portability and fault tolerance. Network performance plays a key role in determining the performance of data intensive applications using Hadoop MapReduce as data required by the map and reduce processes can be distributed across the cluster. In this context, data center designers have been looking at high performance interconnects such as InfiniBand to enhance the performance of their Hadoop MapReduce based applications. However, achieving better performance through usage of high performance interconnects like InfiniBand is a significant task. It requires a careful redesign of communication framework inside MapReduce. Several assumptions made for current socket based communication in the current framework do not hold true for high performance interconnects. In this paper, we propose the design of an RDMA-based Hadoop MapReduce over InfiniBand and several design elements: data shuffle over InfiniBand, in-memory merge mechanism for the Reducer, and pre-fetch data for the Mapper. We perform our experiments on native InfiniBand using Remote Direct Memory Access (RDMA) and compare our results with that of Hadoop-A [1] and default Hadoop over different interconnects and protocols. For all these experiments, we perform network level parameter tuning and use optimum values for each Hadoop design. Our performance results show that, for a 100GB TeraSort running on an eight node cluster, we achieve a performance improvement of 32% over IP-over InfiniBand (IPoIB) and 21% over Hadoop-A. With multiple disks per node, this benefit rises up to 39% over IPoIB and 31% over Hadoop-A.

international parallel and distributed processing symposium | 2014

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Xiaoyi Lu; Fan Liang; Bing Wang; Li Zha; Zhiwei Xu

MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number of key-value pair instances are needed through distributed computation models such as MapReduce, Iteration, and Streaming. We abstract the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features. Utilizing this model, we propose the specification of a minimalistic extension to MPI. An open source communication library, DataMPI, is developed to implement this specification. Performance experiments show that DataMPI has significant advantages in performance and flexibility, while maintaining high productivity, scalability, and fault tolerance of Hadoop.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture

Nusrat Sharmin Islam; Xiaoyi Lu; Md. Wasi-ur-Rahman; Dipti Shankar; Dhabaleswar K. Panda

HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.

international conference on supercomputing | 2014

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects

Wasi-ur Rahman; Xiaoyi Lu; Nusrat Sharmin Islam; Dhabaleswar K. Panda

Hadoop MapReduce is the most popular open-source parallel programming model extensively used in Big Data analytics. Although fault tolerance and platform independence make Hadoop MapReduce the most popular choice for many users, it still has huge performance improvement potentials. Recently, RDMA-based design of Hadoop MapReduce has alleviated major performance bottlenecks with the implementation of many novel design features such as in-memory merge, prefetching and caching of map outputs, and overlapping of merge and reduce phases. Although these features reduce the overall execution time for MapReduce jobs compared to the default framework, further improvement is possible if shuffle and merge phases can also be overlapped with the map phase during job execution. In this paper, we propose HOMR (a Hybrid approach to exploit maximum Overlapping in MapReduce), that incorporates not only the features implemented in RDMA-based design, but also exploits maximum possible overlapping among all different phases compared to current best approaches. Our solution introduces two key concepts: Greedy Shuffle Algorithm and On-demand Shuffle Adjustment, both of which are essential to achieve significant performance benefits over the default MapReduce framework. Architecture of HOMR is generalized enough to provide performance efficiency both over different Sockets interface as well as previous RDMA-based designs over InfiniBand. Performance evaluations show that HOMR with RDMA over InfiniBand can achieve performance benefits of 54% and 56% compared to default Hadoop over IPoIB (IP over InfiniBand) and 10GigE, respectively. Compared to the previous best RDMA-based designs, this benefit is 29%. HOMR over Sockets also achieves a maximum of 38-40% benefit compared to default Hadoop over Sockets interface. We also evaluate our design with real-world workloads like SWIM and PUMA, and observe benefits of up to 16% and 18%, respectively, over the previous best-case RDMA-based design. To the best of our knowledge, this is the first approach to achieve maximum possible overlapping for MapReduce framework.

ieee/acm international symposium cluster, cloud and grid computing | 2013

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Jithin Jose; Mingzhe Li; Xiaoyi Lu; Krishna Chaitanya Kandalla; Mark Daniel Arnold; Dhabaleswar K. Panda

High Performance Computing (HPC) systems are becoming increasingly complex and are also associated with very high operational costs. The cloud computing paradigm, coupled with modern Virtual Machine (VM) technology offers attractive techniques to easily manage large scale systems, while significantly bringing down the cost of computation, memory and storage. However, running HPC applications on cloud systems still remains a major challenge. One of the biggest hurdles in realizing this objective is the performance offered by virtualized computing environments, more specifically, virtualized I/O devices. Since HPC applications and communication middlewares rely heavily on advanced features offered by modern high performance interconnects such as InfiniBand, the performance of virtualized InfiniBand interfaces is crucial. Emerging hardware-based solutions, such as the Single Root I/O Virtualization (SR-IOV), offer an attractive alternative when compared to existing software-based solutions. The benefits of SR-IOV have been widely studied for GigE and 10GigE networks. However, with InfiniBand networks being increasingly adopted in the cloud computing domain, it is critical to fully understand the performance benefits of SR-IOV in InfiniBand network, especially for exploring the performance characteristics and trade-offs of HPC communication middlewares (such as Message Passing Interface (MPI), Partitioned Global Address Space (PGAS)) and applications. To the best of our knowledge, this is the first paper that offers an in-depth analysis on SR-IOV with InfiniBand. Our experimental evaluations show that for the performance of MPI and PGAS point-to-point communication benchmarks over SR-IOV with InfiniBand is comparable to that of the native InfiniBand hardware, for most message lengths. However, we observe that the performance of MPI collective operations over SR-IOV with InfiniBand is inferior to native (non-virtualized) mode. We also evaluate the trade-offs of various VM to CPU mapping policies on modern multi-core architectures and present our experiences.

international parallel and distributed processing symposium | 2015

High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA

Md. Wasi-ur-Rahman; Xiaoyi Lu; Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Dhabaleswar K. Panda

The viability and benefits of running MapReduce over modern High Performance Computing (HPC) clusters, with high performance interconnects and parallel file systems, have attracted much attention in recent times due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Most HPC clusters follow the traditional Beowulf architecture with a separate parallel storage system (e.g. Lustre) and either no, or very limited, local storage. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage system in HPC clusters poses many new opportunities and challenges. In this paper, we propose a novel high-performance design for running YARN MapReduce on such HPC clusters by utilizing Lustre as the storage provider for intermediate data. We identify two different shuffle strategies, RDMA and Lustre Read, for this architecture and provide modules to dynamically detect the best strategy for a given scenario. Our results indicate that due to the performance characteristics of the underlying Lustre setup, one shuffle strategy may outperform another in different HPC environments, and our dynamic detection mechanism can deliver best performance based on the performance characteristics obtained during runtime of job execution. Through this design, we can achieve 44% performance benefit for shuffle-intensive workloads in leadership-class HPC systems. To the best of our knowledge, this is the first attempt to exploit performance characteristics of alternate shuffle strategies for YARN MapReduce with Lustre and RDMA.

ieee international conference on high performance computing, data, and analytics | 2014

High performance MPI library over SR-IOV enabled infiniband clusters

Jie Zhang; Xiaoyi Lu; Jithin Jose; Mingzhe Li; Rong Shi; Dhabaleswar K. Panda

Virtualization has become a central role in HPC Cloud due to easy management and low cost of computation and communication. Recently, Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as InfiniBand and can attain near to native performance for inter-node communication. However, the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within a same physical node. To address this issue, this paper first proposes a high performance design of MPI library over SR-IOV enabled InfiniBand clusters by dynamically detecting VM locality and coordinating data movements between SR-IOV and Inter-VM shared memory (IVShmem) channels. Through our proposed design, MPI applications running in virtualized mode can achieve efficient locality-aware communication on SR-IOV enabled InfiniBand clusters. In addition, we optimize communications in IVShmem and SR-IOV channels by analyzing the performance impact of core mechanisms and parameters inside MPI library to deliver better performance in virtual machines. Finally, we conduct comprehensive performance studies by using point-to-point and collective benchmarks, and HPC applications. Experimental evaluations show that our proposed MPI library design can significantly improve the performance for point-to-point and collective operations, and MPI applications with different InfiniBand transport protocols (RC and UD) by up to 158%, 76%, 43%, respectively, compared with SR-IOV. To the best of our knowledge, this is the first study to offer a high performance MPI library that supports efficient locality aware MPI communication over SR-IOV enabled InfiniBand clusters.

high performance distributed computing | 2014

SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS

Nusrat Sharmin Islam; Xiaoyi Lu; Md. Wasi-ur Rahman; Dhabaleswar K. Panda

In this paper, we propose SOR-HDFS, a SEDA (Staged Event-Driven Architecture)-based approach to improve the performance of HDFS Write operation. This design not only incorporates RDMA-based communication over InfiniBand but also maximizes overlapping among different stages of data transfer and I/O. Performance evaluations show that, the new design improves the aggregated write throughput of Enhanced DFSIO benchmark in Intel HiBench by up to 64% and reduces the job execution time by 37% compared to IPoIB (IP over InfiniBand). Compared to the previous best RDMA-enhanced design [4], the improvements in throughput and execution time are 30% and 20%, respectively. Our design can also improve the performance of HBase Put operation by up to 53% over IPoIB and 29% compared to the previous best RDMA-enhanced HDFS. To the best of our knowledge, this is the first design of SEDA-based HDFS in the literature.

Explore More