Jiuxing Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiuxing Liu is active.

Explore More

Publication

Featured researches published by Jiuxing Liu.

international conference on supercomputing | 2006

A case for high performance computing with virtual machines

Wei Huang; Jiuxing Liu; Bulent Abali; Dhabaleswar K. Panda

Virtual machine (VM) technologies are experiencing a resurgence in both industry and research communities. VMs offer many desirable features such as security, ease of management, OS customization, performance isolation, check-pointing, and migration, which can be very beneficial to the performance and the manageability of high performance computing (HPC) applications. However, very few HPC applications are currently running in a virtualized environment due to the performance overhead of virtualization. Further, using VMs for HPC also introduces additional challenges such as management and distribution of OS images.In this paper we present a case for HPC with virtual machines by introducing a framework which addresses the performance and management overhead associated with VM-based computing. Two key ideas in our design are: Virtual Machine Monitor (VMM) bypass I/O and scalable VM image management. VMM-bypass I/O achieves high communication performance for VMs by exploiting the OS-bypass feature of modern high speed interconnects such as Infini-Band. Scalable VM image management significantly reduces the overhead of distributing and managing VMs in large scale clusters. Our current implementation is based on the Xen VM environment and InfiniBand. However, many of our ideas are readily applicable to other VM environments and high speed interconnects.We carry out detailed analysis on the performance and management overhead of our VM-based HPC framework. Our evaluation shows that HPC applications can achieve almost the same performance as those running in a native, non-virtualized environment. Therefore, our approach holds promise to bring the benefits of VMs to HPC applications with very little degradation in performance.

international conference on supercomputing | 2003

High performance RDMA-based MPI implementation over InfiniBand

Jiuxing Liu; Jiesheng Wu; Sushmitha P. Kini; Pete Wyckoff; Dhabaleswar K. Panda

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 6.8 microseconds for small messages and a peak bandwidth of 871 Million Bytes (831 Mega Bytes) per second. Performance evaluation at the MPI level shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks.

international conference on cluster computing | 2007

High performance virtual machine migration with RDMA over modern interconnects

Wei Huang; Qi Gao; Jiuxing Liu; Dhabaleswar K. Panda

One of the most useful features provided by virtual machine (VM) technologies is the ability to migrate running OS instances across distinct physical nodes. As a basis for many administration tools in modern clusters and data-centers, VM migration is desired to be extremely efficient to reduce both migration time and performance impact on hosted applications. Currently, most VM environments use the Socket interface and the TCP/IP protocol to transfer VM migration traffic. In this paper, we propose a high performance VM migration design by using RDMA (Remote Direct Memory Access). RDMA is a feature provided by many modern high speed interconnects that are currently being widely deployed in data-centers and clusters. By taking advantage of the low software overhead and the one-sided nature of RDMA, our design significantly improves the efficiency of VM migration. We also contribute a set of micro-benchmarks and application-level benchmark evaluations aimed at evaluating important metrics of VM migration. The evaluations using our prototype implementation over Xen and InfiniBand show that RDMA can drastically reduce the migration overhead: up to 80% on total migration time and up to 77% on application observed downtime.

international parallel and distributed processing symposium | 2004

Design and implementation of MPICH2 over InfiniBand with RDMA support

Jiuxing Liu; Weihang Jiang; Pete Wyckoff; Dhabaleswar K. Panda; David Ashton; Darius Buntinas; William Gropp; Brian R. Toonen

Summary form only given. For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences in designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPl-1 functions in MPICH2. One of our objectives is to exploit remote direct memory access (RDMA) in InfiniBand to achieve high performance. We have based our design on the RDMA channel interface provided by MP1CH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS parallel benchmarks. Our optimized MPICH2 implementation achieves 7.6/spl mu/s latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation ofMPICH2 on InfiniBand using RDMA support.

international symposium on microarchitecture | 2004

Microbenchmark performance comparison of high-speed cluster interconnects

Jiuxing Liu; B. Chandrasekaran; Weikuan Yu; Jiesheng Wu; Darius Buntinas; Sushmitha P. Kini; Dhabaleswar K. Panda; Pete Wyckoff

Todays distributed and high-performance applications require high computational power and high communication performance. Recently, the computational power of commodity PCs has doubled about every 18 months. At the same time, network interconnects that provide very low latency and very high bandwidth are also emerging. This is a promising trend in building high-performance computing environments by clustering - combining the computational power of commodity PCs with the communication performance of high-speed network interconnects. There are several network interconnects that provide low latency and high bandwidth. Traditionally, researchers have used simple microbenchmarks, such as latency and bandwidth tests, to characterize a network interconnects communication performance. Later, they proposed more sophisticated models such as LogP. However, these tests and models focus on general parallel computing systems and do not address many features present in these emerging commercial interconnects. Another way to evaluate different network interconnects is to use real-world applications. However, real applications usually run on top of a middleware layer such as the message passing interface (MPI). Our results show that to gain more insight into the performance characteristics of these interconnects, it is important to go beyond simple tests such as those for latency and bandwidth. In future, we plan to expand our microbenchmark suite to include more tests and more interconnects.

cluster computing and the grid | 2004

High performance MPI-2 one-sided communication over InfiniBand

Weihang Jiang; Jiuxing Liu; Hyun-Wook Jin; Dhabaleswar K. Panda; William Gropp; Rajeev Thakur

Many existing MPI-2 one-sided communication implementations are built on top of MPI send/receive operations. Although this approach can achieve good portability, it suffers front high communication overhead and dependency on remote process for communication progress. To address these problems, we propose a high performance MPI-2 one-sided communication design over the InfiniBand Architecture. In our design, MPI-2 one-sided communication operations such as MPI-Put, MPI-Get and MPI-Accumulate are directly mapped to InfiniBand Remote Direct Memory Access (RDMA) operations. Our design has been implemented based on MPICH2 over InfiniBand. We present detailed design issues for this approach and perform a set of microbenchmarks to characterize different aspects of its performance. Our performance evaluation shows that compared with the design based on MPI send/receive, our design can improve throughput up to 77%, and reduce latency and synchronization overhead up to 19% and 13%, respectively. Under certain process skew, the bad impact can be significantly reduced by new design, from 41% to nearly 0%. It also can achieve better overlap of communication and computation.

international parallel and distributed processing symposium | 2004

Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support

Jiuxing Liu; Amith R. Mamidala; Dhabaleswar K. Panda

Summary form only given. Modern high performance applications require efficient and scalable collective communication operations. Currently, most collective operations are implemented based on point-to-point operations. We propose to use hardware multicast in InfiniBand to design fast and scalable broadcast operations in MPl. InfiniBand supports multicast with unreliable datagram (UD) transport service. This makes it hard to be directly used by an upper layer such as MPl. To bridge the semantic gap between MPI/spl I.bar/Bcast and InfiniBand hardware multicast, we have designed and implemented a substrate on top of InfiniBand which provides functionalities such as reliability, inorder delivery and large message handling. By using a sliding-window based design, we improve MPI/spl I.bar/Bcast latency by removing most of the overhead in the substrate out of the communication critical path. By using optimizations such as a new coroot based scheme and delayed ACK, we can further balance and reduce the overhead. We have also addressed many detailed design issues such as buffer management, efficient handling of out-of-order and duplicate messages, timeout and retransmission, flow control and RDMA based ACK communication. Our performance evaluation shows that in an 8 node cluster testbed, hardware multicast based designs can improve MPl broadcast latency up to 58% and broadcast throughput up to 112%. The proposed solutions are also much more tolerant to process skew compared with the current point-to-point based implementation. We have also developed analytical model for our multicast based schemes and validated them with experimental numbers. Our analytical model shows that with the new designs, one can achieve MPl broadcast latency of small messages with 20.0/spl mu/s and of one MTU size message (around 1836 bytes of data payload) with 40.0/spl mu/s in a 1024 node cluster.

conference on high performance computing (supercomputing) | 2004

Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation

Jiuxing Liu; Abhinav Vishnu; Dhabaleswar K. Panda

In the area of cluster computing, InfiniBand is becoming increasingly popular due to its open standard and high performance. However, even with InfiniBand, network bandwidth can still become the performance bottleneck for some of today’s most demanding applications. In this paper, we study the problem of how to overcome the bandwidth bottleneck by using multirail networks. We present different ways of setting up multirail networks with InfiniBand and propose a unified MPI design that can support all these approaches. We have also discussed various important design issues and provided in-depth discussions of different policies of using multirail networks, including an adaptive striping scheme that can dynamically change the striping parameters based on current system condition. We have implemented our design and evaluated it using both microbenchmarks and applications. Our performance results show that multirail networks can significant improve MPI communication performance. With a two rail InfiniBand cluster, we have achieved almost twice the bandwidth and half the latency for large messages compared with the original MPI. At the application level, the multirail MPI can significantly reduce communication time as well as running time depending on the communication pattern. We have also shown that the adaptive striping scheme can achieve excellent performance without a priori knowledge of the bandwidth of each rail.

international conference on supercomputing | 2009

Virtualization polling engine (VPE): using dedicated CPU cores to accelerate I/O virtualization

Jiuxing Liu; Bulent Abali

Virtual machine (VM) technologies are making rapid progress and VM performance is approaching that of native hardware in many aspects. Achieving high performance for I/O virtualization remains a challenge, however, especially for high speed networking devices such as 10 Gigabit Ethernet 10 GbE) NICs. Traditional software-based approaches to I/O virtualization usually suffer significant performance degradation compared with native hardware. Hardware-based approaches that allow direct device accessin VMs can achieve good performance, albeit at the expense of increased hardware cost and increased complexity in achieving tasks such as VM checkpointing, migration, and record/reply. Recently, the trend in microprocessor design has shifted from achieving higher CPU frequencies to putting more cores in a single chip, thus the cost of each core is rapidly decreasing. In this paper, we propose a new I/O virtualization approach called the Virtualization Polling Engine (VPE). VPE introduces a concept called virtualization onload, which takes advantage of dedicated CPU cores to help with the virtualization of I/O devices by using an event-driven execution model with dedicated polling threads. It can significantly reduce virtualization overhead and achieve performance close to the hardware-based approaches without requiring special hardware support. Using our VPE approach, we developed a prototype called KVM-VPE to provide Ethernet virtualization support for KVM. Our experiments in a 10GbE testbed showed that VPE significantly outperformed the original KVM. In Netperf TCP tests our prototype achieved over 5 times the bandwidth for transmitting (Tx) and over 3 times the bandwidth for receiving (Rx) compared with the original KVM. KVM-VPE also supports direct user application access to the virtual Ethernet interfaces and achieved 7.4 μs end-to-end latency between two VMs on different machines in our testbed. Overall, our research demonstrated that VPE is a promising approach to high performance I/O virtualization in the coming multicore era.

virtual execution environments | 2007

Nomad: migrating OS-bypass networks in virtual machines

Wei Huang; Jiuxing Liu; Matthew J. Koop; Bulent Abali; Dhabaleswar K. Panda

Virtual machine (VM) technology is experiencing a resurgence due to various benefits including ease of management, security and resource consolidation. Live migration of virtual machines allows transparent movement of OS instances and hosted applications across physical machines. It is one of the most useful features of VM technology because it provides a powerful tool for effective administration of modern cluster environments. Migrating network resources is one of the key problems that need to be addressed in the VM migration process. Existing studies of VM migration have focused on traditional I/O interfaces such as Ethernet. However, modern high-speed interconnects with intelligent NICs pose significantly more challenges as they have additional features including hardware level reliable services and direct I/O accesses. In this paper we present Nomad, a design for migrating modern interconnects with the aforementioned features, focusing on cluster environments running VMs. We introduce a thin namespace virtualization layer to efficiently address location dependent resource handles and a handshake protocol which transparently maintains reliable service semantics during migration. We demonstrate our design by implementing a prototype based on the Xen virtual machine monitor and InfiniBand. Our performance analysis shows that Nomad can achieve efficient migration of network resources, even in environments with stringent communication performance requirements.

Explore More