Miao Luo
Ohio State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Miao Luo.
international conference on parallel processing | 2011
Jithin Jose; Hari Subramoni; Miao Luo; Minjia Zhang; Jian Huang; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Hao Wang; Sayantan Sur; Dhabaleswar K. Panda
Memcached is a key-value distributed memory object caching system. It is used widely in the data-center environment for caching results of database calls, API calls or any other data. Using Memcached, spare memory in data-center servers can be aggregated to speed up lookups of frequently accessed information. The performance of Memcached is directly related to the underlying networking technology, as workloads are often latency sensitive. The existing Memcached implementation is built upon BSD Sockets interface. Sockets offers byte-stream oriented semantics. Therefore, using Sockets, there is a conversion between Memcacheds memory-object semantics and Sockets byte-stream semantics, imposing an overhead. This is in addition to any extra memory copies in the Sockets implementation within the OS. Over the past decade, high performance interconnects have employed Remote Direct Memory Access (RDMA) technology to provide excellent performance for the scientific computation domain. In addition to its high raw performance, the memory-based semantics of RDMA fits very well with Memcacheds memory-object model. While the Sockets interface can be ported to use RDMA, it is not very efficient when compared with low-level RDMA APIs. In this paper, we describe a novel design of Memcached for RDMA capable networks. Our design extends the existing open-source Memcached software and makes it RDMA capable. We provide a detailed performance comparison of our Memcached design compared to unmodified Memcached using Sockets over RDMA and 10Gigabit Ethernet network with hardware-accelerated TCP/IP. Our performance evaluation reveals that latency of Memcached Get of 4KB size can be brought down to 12 µs using ConnectX InfiniBand QDR adapters. Latency of the same operation using older generation DDR adapters is about 20µs. These numbers are about a factor of four better than the performance obtained by using 10GigE with TCP Offload. In addition, these latencies of Get requests over a range of message sizes are better by a factor of five to ten compared to IP over InfiniBand and Sockets Direct Protocol over InfiniBand. Further, throughput of small Get operations can be improved by a factor of six when compared to Sockets over 10 Gigabit Ethernet network. Similar factor of six improvement in throughput is observed over Sockets Direct Protocol using ConnectX QDR adapters. To the best of our knowledge, this is the first such memcached design on high performance RDMA capable interconnects.
Computer Science - Research and Development | 2011
Hao Wang; Sreeram Potluri; Miao Luo; Ashish Kumar Singh; Sayantan Sur; Dhabaleswar K. Panda
Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and programmer productivity. Applications executing on a cluster with GPUs have to manage data movement using CUDA in addition to MPI, the de-facto parallel programming standard. Currently, data movement with CUDA and MPI libraries is not integrated and it is not as efficient as possible. In addition, MPI-2 one sided communication does not work for windows in GPU memory, as there is no way to remotely get or put data from GPU memory in a one-sided manner.In this paper, we propose a novel MPI design that integrates CUDA data movement transparently with MPI. The programmer is presented with one MPI interface that can communicate to and from GPUs. Data movement from GPU and network can now be overlapped. The proposed design is incorporated into the MVAPICH2 library. To the best of our knowledge, this is the first work of its kind to enable advanced MPI features and optimized pipelining in a widely used MPI library. We observe up to 45% improvement in one-way latency. In addition, we show that collective communication performance can be improved significantly: 32%, 37% and 30% improvement for Scatter, Gather and Allotall collective operations, respectively. Further, we enable MPI-2 one sided communication with GPUs. We observe up to 45% improvement for Put and Get operations.
international parallel and distributed processing symposium | 2012
Jian Huang; Jithin Jose; Md. Wasi-ur-Rahman; Hao Wang; Miao Luo; Hari Subramoni; Chet Murthy; Dhabaleswar K. Panda
HBase is an open source distributed Key/Value store based on the idea of Big Table. It is being used in many data-center Papplications (e.g. Face book, Twitter, etc.) because of its portability and massive scalability. For this kind of system, low latency and high throughput is expected when supporting services for large scale concurrent accesses. However, the existing HBase implementation is built upon Java Sockets Interface that provides sub-optimal performance due to the overhead to provide cross-platform portability. The byte-stream oriented Java sockets semantics confine the possibility to leverage new generations of network technologies. This makes it hard to provide high performance services for data-intensive applications. High Performance Computing (HPC) domain has exploited high performance and low latency networks such as Infini Band for many years. These interconnects provide advanced network features, such as Remote Direct Memory Access (RDMA), to achieve high throughput and low latency along with low CPU utilization. RDMA follows memory-block semantics, which can be adopted efficiently to satisfy the object transmission primitives used in HBase. In this paper, we present a novel design of HBase for RDMA capable networks via Java Native Interface (JNI). Our design extends the existing open-source HBase software and makes it RDMA capable. Our performance evaluation reveals that latency of HBase Get operations of 1KB message size can be reduced to 43.7μs with the new design on QDR platform (32 Gbps). This is about a factor of 3.5 improvement over 10 Gigabit Ethernet (10 GigE) network with TCP Offload. Throughput evaluations using four HBase region servers and 64 clients indicate that the new design boosts up throughput by 3 X times over 1 GigE and 10 GigE networks. To the best of our knowledge, this is first HBase design utilizing high performance RDMA capable interconnects.
international conference on cluster computing | 2009
Hari Subramoni; Ping Lai; Miao Luo; Dhabaleswar K. Panda
Though convergence has been a buzzword in the networking industry for sometime now, no vendor has successfully brought out a solution which combines the ubiquitous nature of Ethernet with the low latency and high performance capabilities that InfiniBand offers. Most of the overlay protocols introduced in the past have had to bear with some form of performance trade off or overhead. Recent advances in InfiniBand interconnect technology has allowed vendors to come out with a new model for network convergence — RDMA over Ethernet (RDMAoE). In this model, the IB packets are encapsulated into Ethernet frames thereby allowing us to transmit them seamlessly over an Ethernet network. The job of translating InfiniBand addresses to Ethernet addresses and back is taken care of by the InfiniBand HCA. This model, allows end users access to large computational clusters through the use of ubiquitous Ethernet interconnect technology while retaining the high performance, low latency guarantees that InfiniBand provides. In this paper, we present a detailed evaluation and analysis of the new RDMAoE protocol as opposed to the earlier overlay protocols as well as native-IB and socket based implementations. Through these evaluations, we also look at whether RDMAoE brings us closer the eventual goal of network convergence. The experimental results obtained with verbs, MPI, application and data center level evaluations show that RDMAoE is capable of providing performance comparable to Native-IB based applications on a standard 10GigE network.
international conference on parallel processing | 2012
Jithin Jose; Krishna Chaitanya Kandalla; Miao Luo; Dhabaleswar K. Panda
Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. Open SHMEM is a library-based implementation of the PGAS model and it aims to standardize the SHMEM model to achieve performance, programmability and portability. However, Open SHMEM is an emerging standard and it is unlikely that entire an application will be re-written with it. Instead, it is more likely that applications will continue to be written with MPI as the primary model, but parts of them will be re-designed with newer models. This requires the underlying communication libraries to be designed with support for multiple programming models. In this paper, we propose a high performance, scalable unified communication library that supports both MPI and Open SHMEM for InfiniBand clusters. To the best of our knowledge, this is the first effort in unifying MPI and Open SHMEM communication libraries. Our proposed designs take advantage of InfiniBands advanced features to significantly improve the communication performance of various atomic and collective operations defined in Open SHMEM specification. Hybrid (MPI+Open SHMEM) parallel applications can benefit from our proposed library to achieve better efficiency and scalability. Our studies show that our proposed designs can improve the performance of OpenSHMEMs atomic operations and collective operations by up to 41%. We observe that our designs improve the performance of the 2D-Heat Modeling benchmark (pure Open-SHMEM) by up to 45%. We also observe that our unified communication library can improve the performance of the hybrid (MPI+Open SHMEM) version of Graph500 benchmark by up to 35%. Moreover, our studies also indicate that our proposed designs lead to lower memory consumption due to efficient utilization of the network resources.
international conference on parallel processing | 2012
Nusrat Sharmin Islam; Raghunath Rajachandrasekar; Jithin Jose; Miao Luo; Hao Wang; Dhabaleswar K. Panda
Many applications cache huge amount of data in RAM to achieve high performance. A good example is Memcached, a distributed-memory object-caching software. Memcached performance directly depends on the aggregated memory pool size. Given the constraints of hardware cost, power/thermal concerns and floor plan limits, it is difficult to further scale the memory pool by packing more RAM into individual servers, or by expanding the server array horizontally. In this paper, we propose an SSD-Assisted Hybrid Memory that expands RAM with SSD to make available a large amount of memory. Hybrid memory works as an object cache and it manages resource allocation at object granularity, which is more efficient than allocation at page granularity. It leverages the SSD fast random read property to achieve low latency object access. It organizes SSD into a log-structured sequence of blocks to overcome SSD writing anomalies. Compared to alternatives that use SSD as a virtual memory swap device, hybrid memory reduces the random access latency by 68% and 72% for read and write operations, and improves operation throughput by 15.3 times. Additionally, it reduces write traffic to SSD by 81%, which implies a 5.3 times improvement in SSD lifetime. We have integrated our hybrid memory design into Memcached. Our experiments indicate a 3.7X reduction in Memcached Get operation latency and up to 5.3X improvement in operation throughput. To the best of our knowledge, this paper is the first work that integrates the cutting edge SSD and InfiniBand-verbs into Memcached to accelerate its performance.
international conference on supercomputing | 2012
Miao Luo; Dhabaleswar K. Panda; Khaled Z. Ibrahim; Costin Iancu
Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40\% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve all-to-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60\% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.
ieee international conference on high performance computing, data, and analytics | 2011
Miao Luo; Jithin Jose; Sayantan Sur; Dhabaleswar K. Panda
Multi-core architectures are becoming more and more popular in HEC (High End Computing) era. Recent trends of high-productivity computing in conjunction with advanced multi-core and network architectures have increased the interest in Global Address Space (PGAS) languages, due to its high-productivity feature and better applicability. Unified Parallel C (UPC) is an emerging PGAS language. In this paper, we compare different design alternatives for a high-performance and scalable UPC runtime on multi-core nodes, from several aspects: performance, portability, interoperability and support for irregular parallelism. Based on our analysis, we present a novel design of a multi-threaded UPC runtime that supports multi-endpoints. Our runtime is able to dramatically decrease network access contention resulting in 80% lower latency for fine-grained memget/memput operations and almost doubling the bandwidth for medium size messages, compared to multi-threaded Berkeley UPC Runtime. Furthermore, the multi-endpoint design opens up new doors for runtime optimizations — such as support for irregular parallelism. We utilize true network helper threads and load-balancing via work stealing in the runtime. Our evaluation with novel benchmarks shows that our runtime can achieve 90% of the peak efficiency, which is a factor of 1.3 times better than existing Berkeley UPC Runtime. To the best of our knowledge, this is the first work in which multi-network endpoint capable UPC runtime design is proposed for modern multi-core systems.
international conference on cluster computing | 2009
Matthew J. Koop; Miao Luo; Dhabaleswar K. Panda
Multi-core systems are now extremely common in modern clusters. In the past commodity systems may have had up to two or four CPUs per compute node. In modern clusters, these systems still have the same number of CPUs, however, these CPUs have moved from single-core to quad-core and further advances are imminent. To obtain the best performance, compute nodes in a cluster are connected with high-performance interconnects. On nearly all clusters, the number of network interfaces is the same on current multi-core systems as in the past when there were fewer cores per node. Although these networks have increased bandwidth with the shift to multi-core, there still exists severe network contention for some application patterns. In this work we propose mixed workload (non-exclusive) scheduling of jobs to increase network efficiency and reduce contention. As a case-study we use Message Passing Interface (MPI) programs on the InfiniBand interconnect. We show through detailed profiling of the network that accesses of the network and CPU of some applications are complementary to each other and lead to increased network efficiency and overall application performance improvement. We show improvements of 20% and more for some of the NAS Parallel Benchmarks on quad-socket, quad-core AMD systems.
acm sigplan symposium on principles and practice of parallel programming | 2014
Miao Luo; Xiaoyi Lu; Khaled Hamidouche; Krishna Chaitanya Kandalla; Dhabaleswar K. Panda
State-of-the-art MPI libraries rely on locks to guarantee thread-safety. This discourages application developers from using multiple threads to perform MPI operations. In this paper, we propose a high performance, lock-free multi-endpoint MPI runtime, which can achieve up to 40\% improvement for point-to-point operation and one representative collective operation with minimum or no modifications to the existing applications.