Ping Lai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ping Lai is active.

Explore More

Publication

Featured researches published by Ping Lai.

high performance computational finance | 2008

Design and evaluation of benchmarks for financial applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Hari Subramoni; Gregory Marsh; Sundeep Narravula; Ping Lai; Dhabaleswar K. Panda

Message oriented middleware (MOM) is a key technology in financial market data delivery. In this context we study the advanced message queuing protocol (AMQP), an emerging open standard for MOM communication. We design a basic suite of benchmarks for AMQPpsilas Direct, Fanout, and Topic Exchange types. We then evaluate these benchmarks with Apache Qpid, an open source implementation of AMQP. In order to observe how AMQP performs in a real-life scenario, we also perform evaluations with a simulated stock exchange application. All our evaluations are performed over InfiniBand as well as 1 Gigabit Ethernet networks. Our results indicate that in order to achieve the high scalability requirements demanded by high performance computational finance applications, we need to use modern communication protocols, like RDMA, which place less processing load on the host. We also find that the centralized architecture of AMQP presents a considerable bottleneck as far as scalability is concerned.

international conference on cluster computing | 2009

RDMA over Ethernet — A preliminary study

Hari Subramoni; Ping Lai; Miao Luo; Dhabaleswar K. Panda

Though convergence has been a buzzword in the networking industry for sometime now, no vendor has successfully brought out a solution which combines the ubiquitous nature of Ethernet with the low latency and high performance capabilities that InfiniBand offers. Most of the overlay protocols introduced in the past have had to bear with some form of performance trade off or overhead. Recent advances in InfiniBand interconnect technology has allowed vendors to come out with a new model for network convergence — RDMA over Ethernet (RDMAoE). In this model, the IB packets are encapsulated into Ethernet frames thereby allowing us to transmit them seamlessly over an Ethernet network. The job of translating InfiniBand addresses to Ethernet addresses and back is taken care of by the InfiniBand HCA. This model, allows end users access to large computational clusters through the use of ubiquitous Ethernet interconnect technology while retaining the high performance, low latency guarantees that InfiniBand provides. In this paper, we present a detailed evaluation and analysis of the new RDMAoE protocol as opposed to the earlier overlay protocols as well as native-IB and socket based implementations. Through these evaluations, we also look at whether RDMAoE brings us closer the eventual goal of network convergence. The experimental results obtained with verbs, MPI, application and data center level evaluations show that RDMAoE is capable of providing performance comparable to Native-IB based applications on a standard 10GigE network.

international conference on parallel processing | 2008

Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems

Lei Chai; Ping Lai; Hyun-Wook Jin; Dhabaleswar K. Panda

The emergence of multi-core processors has made MPI intra-node communication a critical component in high performance computing. In this paper, we use a three-step methodology to design an efficient MPI intra-node communication scheme from two popular approaches: shared memory and OS kernel-assisted direct copy. We use an Intel quad-core cluster for our study. We first run micro-benchmarks to analyze the advantages and limitations of these two approaches, including the impacts of processor topology, communication buffer reuse, process skew effects, and L2 cache utilization. Based on the results and the analysis, we propose topology-aware and skew-aware thresholds to build an optimized hybrid approach. Finally, we evaluate the impact of the hybrid approach on MPI collective operations and applications using IMB, NAS, PSTSWM, and HPL benchmarks. We observe that the optimized hybrid approach can improve the performance of MPI collective operations by up to 60%, and applications by up to 17%.

grid computing | 2010

High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand

Hari Subramoni; Ping Lai; Rajkumar Kettimuthu; Dhabaleswar K. Panda

GridFTP, designed using the Globus XIO framework, is one of the most popular methods in use to perform data transfers in the grid environment. But the performance of GridFTP in WAN is limited by the relatively low communication bandwidth offered by the existing network protocols. On the other hand, modern interconnects such as InfiniBand, with many advanced communication features like zero-copy protocol and RDMA operations, can greatly improve communication efficiency. In this paper, we take on the challenge of combining the ease of use of the Globus XIO framework and the high performance achieved through InfiniBand communication, thereby natively sup-porting GridFTP over InfiniBand based networks. The Advanced Data Transfer Service (ADTS), designed in our previous work, provides the low level InfiniBand support to the Globus XIO layer. We introduce the concepts of I/Ostaging in the Globus XIO ADTS driver to achieve efficient disk based data transfers. We evaluate our designs in both LAN and WAN environments using micro benchmarks as well as communication traces from several real world applications. We also provide insights into the communication performance with some in-depth analysis. Our experimental evaluation shows a performance improvement of up to100% for ADTS based data transfers as opposed to TCP or UDP based ones in LAN and high delay WAN scenarios.

international conference on supercomputing | 2010

Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application

Sreeram Potluri; Ping Lai; Karen Tomko; Sayantan Sur; Yifeng Cui; Mahidhar Tatineni; Karl W. Schulz; William L. Barth; Amitava Majumdar; Dhabaleswar K. Panda

AWM-Olsen is a widely used ground motion simulation code based on a parallel finite difference solution of the 3-D velocity-stress wave equation. This application runs on tens of thousands of cores and consumes several million CPU hours on the TeraGrid Clusters every year. A significant portion of its run-time (37% in a 4,096 process run), is spent in MPI communication routines. Hence, it demands an optimized communication design coupled with a low-latency, high-bandwidth network and an efficient communication subsystem for good performance. In this paper, we analyze the performance bottlenecks of the application with regard to the time spent in MPI communication calls. We find that much of this time can be overlapped with computation using MPI non-blocking calls. We use both two-sided and MPI-2 one-sided communication semantics to re-design the communication in AWM-Olsen. We find that with our new design, using MPI-2 one-sided communication semantics, the entire application can be sped up by 12% at 4K processes and by 10% at 8K processes on a state-of-the-art InfiniBand cluster, Ranger at the Texas Advanced Computing Center (TACC).

international conference on parallel processing | 2009

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand

Ping Lai; Hari Subramoni; Sundeep Narravula; Amith R. Mamidala; Dhabaleswar K. Panda

The rapid growth of InfiniBand, 10 Gigabit Ethernet/iWARP and IB WAN extensions is increasingly gaining momentum for designing high end computing clusters and data-centers. For typical applications such as data staging, content replication and remote site backup, FTP has been the most popular method to transfer data within and across these clusters. Although the existing sockets based FTP approaches can be transparently used in these systems through the protocols like IPoIB or SDP, their performance and scalability are limited due to the additional interaction overhead and unoptimized protocol processing. This leads to a challenge how to design more efficient FTP mechanisms by leveraging the advanced features of modern interconnects. In this paper we design a new Advanced Data Transfer Service (ADTS) with the capabilities such as zero-copy data-transfer, memory registration cache, persistent data sessions and pipelined data transfer etc. to enable efficient zero-copy data transfers over IB and iWARP equipped LAN and WAN. We then utilize ADTS to design a high performance FTP library (FTP-ADTS). From our experimental results, we observe that our design outperforms existing sockets based approaches by more that 95 in transferring large volumes of data over LAN. It also provides significantly better performance at much lower (by up to a factor of 6) CPU utilization in various IB WAN scenarios. These results present the promising future for designing high performance communication protocols to power the efficiency and scalability of next-generation parallel and distributed environments.

Computer Science - Research and Development | 2010

Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems

Ping Lai; Sayantan Sur; Dhabaleswar K. Panda

The increasing popularity of multi-core processors has made MPI intra-node communication, including the intra-node RMA (Remote Memory Access) communication, a critical component in high performance computing. MPI-2 RMA model includes one-sided data transfer and synchronization operations. Existing designs in popularly used MPI stacks do not provide truly one-sided intra-node RMA communication. They are built on top of two-sided send-receive operations, therefore suffering from overheads of two-sided communication and dependency on the remote side. In this paper, we enhance existing shared memory mechanisms to design truly one-sided synchronization. In addition, we design truly one-sided intra-node data transfer using two kernel based direct copy alternatives: basic kernel-assisted approach and I/OAT-assisted approach. Our new design eliminates the overhead of using two-sided operations and eliminates the involvement from the remote side. We also propose a series of benchmarks to evaluate various performance aspects over multi-core architectures (Intel Clovertown, Intel Nehalem and AMD Barcelona). The results show that the new design obtains up to 39% lower latency for small and medium messages and demonstrates 29% improvement in large message bandwidth. Moreover, it provides superior performance in terms of better scalability, reduced cache misses, higher resilience to process skew and increased computation and communication overlap. Finally, up to 10% performance benefits is demonstrated for a real scientific application AWM-Olsen.

international conference on parallel processing | 2008

Performance of HPC Middleware over InfiniBand WAN

Sundeep Narravula; Hari Subramoni; Ping Lai; Ranjit Noronha; Dhabaleswar K. Panda

High performance interconnects such as InfiniBand (IB)have enabled large scale deployments of High Performance Computing (HPC) systems. High performance communication and IO middleware such as MPI and NFS over RDMA have also been redesigned to leverage the performance of these modern interconnects. With the advent of long haul InfiniBand (IB WAN), IB applications now have inter-cluster reaches. While this technology is intended to enable high performance network connectivity across WAN links,it is important to study and characterize the actual performance that the existing IB middleware achieve in these emerging IB WAN scenarios. In this paper, we study and analyze the performance characteristics of the following three HPC middleware: (i)IPoIB (IP traffic over IB), (ii) MPI and (iii) NFS over RDMA. We utilize the Obsidian IB WAN routers for inter-cluster connectivity. Our results show that many of the applications absorb smaller network delays fairly well. However, most approaches get severely impacted in high delay scenarios. Further, communication protocols need to be optimized in higher delay scenarios to improve the performance. In this paper, we propose several such optimizations to improve communication performance. Our experimental results show that techniques such as WAN-aware protocols, transferring data using large messages (message coalescing) and using parallel data streams can improve the communication performance (up to 50%) in high delay scenarios. Overall, these results demonstrate that IB WAN technologies can enable cluster-of-clusters architecture as a feasible platform for HPC systems.

international conference on parallel processing | 2010

Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-core InfiniBand Clusters

Hari Subramoni; Ping Lai; Sayantan Sur; Dhabaleswar K. Panda

Network congestion is an important factor affecting the performance of large scale jobs in supercomputing clusters, especially with the wide deployment of multi-core processors. The blocking nature of current day collectives makes such congestion a critical factor in their performance. On the other hand, modern interconnects like InfiniBand provide us with many novel features such as Virtual Lanes aimed at delivering better performance to end applications. Theoretical research in the field of network congestion indicate Head of Line (HoL) blocking as a common causes for congestion and the use of multiple virtual lanes as one of the ways to alleviate it. In this context, we make use of the multiple virtual lanes provided by the InfiniBand standard as a means to alleviate network congestion and thereby improve the performance of various high performance computing applications on modern multi-core clusters. We integrate our scheme into the MVAPICH2 MPI library. To the best of our knowledge, this is the first such implementation that takes advantage of the use of multiple virtual lanes at the MPI level. We perform various experiments at native InfiniBand, microbenchmark as well as at the application levels. The results of our experimental evaluation show that the use of multiple virtual lanes can improve the predictability of message arrival by up to 10 times in the presence of network congestion. Our microbenchmark level evaluation with multiple communication streams show that the use of multiple virtual lanes can improve the bandwidth / latency / message rate of medium sized messages by up to 13%. Through the use of multiple virtual lanes, we are also able to improve the performance of the Alltoall collective operation for medium message sizes by up to 20%. Performance improvement of up to 12% is also observed for Alltoall collective operation through segregation of traffic into multiple virtual lanes when multiple jobs compete for the same network resource. We also see that our scheme can improve the performance of collective operations used inside the CPMD application by 11% and the overall performance of the CPMD application itself by up to 6%.

cluster computing and the grid | 2008

Optimized Distributed Data Sharing Substrate in Multi-core Commodity Clusters: A Comprehensive Study with Applications

Karthikeyan Vaidyanathan; Ping Lai; Sundeep Narravula; Dhabaleswar K. Panda

Distributed applications tend to have a complex design due to issues such as concurrency, synchronization and communication. Researchers in the past have proposed simpler abstractions to hide these complexities. However, many of the proposed techniques use messaging protocols which incur high overhead and are not very scalable. To address these limitations, in our previous work [20], we proposed an efficient Distributed Data Sharing Substrate (DDSS) using the features of high-speed networks. In this paper, we propose several design optimizations for DDSS in multi-core systems such as the combination of shared memory and message queues for inter-process communication, dedicated thread for communication progress and for onloading DDSS operations such as get and put. Our micro-benchmark results not only show a very low latency in DDSS operations but also demonstrate the scalability of DDSS with increasing number of processes. Application evaluations with R-Tree and B-Tree query processing and distributed STORM shows an improvement of up to 56%, 45% and 44%, respectively, as compared to traditional implementations. Evaluations with application checkpointing using DDSS demonstrate the scalability with increasing number of checkpointing applications. Further, in our evaluations, we demonstrate the portability of DDSS across multiple modern interconnects including InfiniBand and iWARP-capable 10-Gigabit Ethernet networks (applicable for both LAN/WAN environments).

Explore More