Jiesheng Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiesheng Wu is active.

Explore More

Publication

Featured researches published by Jiesheng Wu.

international conference on supercomputing | 2003

High performance RDMA-based MPI implementation over InfiniBand

Jiuxing Liu; Jiesheng Wu; Sushmitha P. Kini; Pete Wyckoff; Dhabaleswar K. Panda

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 6.8 microseconds for small messages and a peak bandwidth of 871 Million Bytes (831 Mega Bytes) per second. Performance evaluation at the MPI level shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks.

international symposium on microarchitecture | 2004

Microbenchmark performance comparison of high-speed cluster interconnects

Jiuxing Liu; B. Chandrasekaran; Weikuan Yu; Jiesheng Wu; Darius Buntinas; Sushmitha P. Kini; Dhabaleswar K. Panda; Pete Wyckoff

Todays distributed and high-performance applications require high computational power and high communication performance. Recently, the computational power of commodity PCs has doubled about every 18 months. At the same time, network interconnects that provide very low latency and very high bandwidth are also emerging. This is a promising trend in building high-performance computing environments by clustering - combining the computational power of commodity PCs with the communication performance of high-speed network interconnects. There are several network interconnects that provide low latency and high bandwidth. Traditionally, researchers have used simple microbenchmarks, such as latency and bandwidth tests, to characterize a network interconnects communication performance. Later, they proposed more sophisticated models such as LogP. However, these tests and models focus on general parallel computing systems and do not address many features present in these emerging commercial interconnects. Another way to evaluate different network interconnects is to use real-world applications. However, real applications usually run on top of a middleware layer such as the message passing interface (MPI). Our results show that to gain more insight into the performance characteristics of these interconnects, it is important to go beyond simple tests such as those for latency and bandwidth. In future, we plan to expand our microbenchmark suite to include more tests and more interconnects.

international conference on parallel processing | 2003

PVFS over InfiniBand: design and performance evaluation

Jiesheng Wu; Pete Wyckoff; Dhabaleswar K. Panda

I/O is quickly emerging as the main bottleneck limiting performance in modern day clusters. The need for scalable parallel I/O and file systems is becoming more and more urgent. We examine the feasibility of leveraging infiniband technology to improve I/O performance and scalability of cluster file systems. We use parallel virtual file system (PVFS) as a basis for exploring these features. We design and implement a PVFS version on InfiniBand by taking advantage of InfiniBand features and resolving many challenging issues. We design the following: a transport layer customized for PVFS by trading transparency and generality for performance; buffer management for flow control, dynamic and fair buffer sharing, and efficient memory registration and deregistration. Compared to a PVFS implementation over standard TCP/IP on the same InfiniBand network, our implementation offers three times the bandwidth if workloads are not disk-bound and 40% improvement in bandwidth in the disk-bound case. Client CPU utilization is reduced to 1.5% from 91% on TCP/IP. To the best of our knowledge, this is the first design, implementation and evaluation of PVFS over InfiniBand. The research results demonstrate how to design high performance parallel file systems on next generation clusters with InfiniBand

international symposium on performance analysis of systems and software | 2004

Sockets Direct Protocol over InfiniBand in clusters: is it beneficial?

Pavan Balaji; Sundeep Narravula; Karthikeyan Vaidyanathan; Savitha Krishnamoorthy; Jiesheng Wu; Dhabaleswar K. Panda

The Sockets Direct Protocol (SDP) had been proposed recently in order to enable sockets based applications to take advantage of the enhanced features provided by InfiniBand architecture. In this paper, we study the benefits and limitations of an implementation of SDP. We first analyze the performance of SDP based on a detailed suite of micro-benchmarks. Next, we evaluate it on two different real application domains: (1) A multitier data-center environment and (2) A Parallel Virtual File System (PVFS). Our micro-benchmark results show that SDP is able to provide up to 2.7 times better bandwidth as compared to the native sockets implementation over InfiniBand (IPoIB) and significantly better latency for large message sizes. Our experimental results also show that SDP is able to achieve a considerably higher performance (improvement of up to 2.4 times) as compared to IPoIB in the PVFS environment. In the data-center environment, SDP outperforms IPoIB for large file transfers inspite of currently being limited by a high connection setup time. However, this limitation is entirely implementation specific and as the InfiniBand software and hardware products are rapidly maturing, we expect this limitation to be overcome soon. Based on this, we have shown that the projected performance for SDP, without the connection setup time, can outperform IPoIB for small message transfers as well.

international parallel and distributed processing symposium | 2004

High performance implementation of MPI derived datatype communication over InfiniBand

Jiesheng Wu; Pete Wyckoff; Dhabaleswar K. Panda

Summary form only given. In this paper, a systematic study of two main types of approach for MPI datatype communication (pack/unpack-based approaches and copy-reduced approaches) is carried out on the InfiniBand network. We focus on overlapping packing, network communication, and unpacking in the pack/unpack-based approaches. We use RDMA operations to avoid packing and/or unpacking in the copy-reduced approaches. Four schemes (buffer-centric segment pack/unpack, RDMA write gather with unpack, pack with RDMA read scatter, and multiple RDMA writes have been proposed. Three of them have been implemented and evaluated based on one MPI implementation over InfiniBand. Performance results of a vector microbenchmark demonstrate that latency is improved by a factor of up to 3.4 and bandwidth by a factor of up to 3.6 compared to the current datatype communication implementation. Collective operations like MPI Alltoall are demonstrated to benefit. A factor of up to 2.0 improvement has been seen in our measurements of those collective operations on an 8-node system.

Lecture Notes in Computer Science | 2003

Fast and Scalable Barrier Using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters

Sushmitha P. Kini; Jiuxing Liu; Jiesheng Wu; Pete Wyckoff; Dhabaleswar K. Panda

This paper describes a methodology for efficiently implementing the barrier operation, on clusters with the emerging InfiniBand Architecture (IBA). IBA provides hardware level support for the Remote Direct Memory Access (RDMA) message passing model as well as the multicast operation. This paper describes the design, implementation and evaluation of three barrier algorithms that leverage these mechanisms. Performance evaluation studies indicate that considerable benefits can be achieved using these mechanisms compared to the traditional implementation based on the point-to-point message passing model. Our experimental results show a performance benefit of up to 1.29 times for a 16-node barrier and up to 1.71 times for non-powers-of-2 group size barriers. Each proposed algorithm performs the best for certain ranges of group sizes and the optimal algorithm can be chosen based on this range. To the best of our knowledge, this is the first attempt to characterize the multicast performance in IBA and to demonstrate the benefits achieved by combining it with RDMA operations for efficient implementations of barrier. This framework has significant potential for developing scalable collective communication libraries for IBA-based clusters.

high performance interconnects | 2003

Micro-benchmark level performance comparison of high-speed cluster interconnects

Jiuxing Liu; B. Chandrasekaran; Weikuan Yu; Jiesheng Wu; Darius Buntinas; Sushmitha P. Kini; Pete Wyckoff; Dhabaleswar K. Panda

In this paper we present a comprehensive performance evaluation of three high speed cluster interconnects: Infini-Band, Myrinet and Quadrics. We propose a set of micro-benchmarks to characterize different performance aspects of these interconnects. Our micro-benchmark suite includes not only traditional tests and performance parameters, but also those specifically tailored to the interconnects advanced features such as user-level access for performing communication and remote direct memory access. In order to explore the full communication capability of the interconnects, we have implemented the micro-benchmark suite at the low level messaging layer provided by each interconnect. Our performance results show that all three interconnects achieve low latency, high bandwidth and low host overhead. However, they show quite different performance behaviors when handling completion notification, unbalanced communication patterns and different communication buffer reuse patterns.

Lecture Notes in Computer Science | 2004

Zero-Copy MPI Derived Datatype Communication over InfiniBand

Gopalakrishnan Santhanaraman; Jiesheng Wu; Dhabaleswar K. Panda

This paper presents a new scheme, Send Gather Receive Scatter (SGRS), to perform zero-copy datatype communication over InfiniBand. This scheme leverages the gather/scatter feature provided by InfiniBand channel semantics. It takes advantage of the capability of processing non-contiguity on both send and receive sides in the Send Gather and Receive Scatter operations. In this paper, we describe the design, implementation and evaluation of this new scheme. Compared to the existing Multi-W zero-copy datatype scheme, the SGRS scheme can overcome the drawbacks of low network utilization and high startup costs. Our experimental results show significant improvement in both point-to-point and collective datatype communication. The latency of a vector datatype can be reduced by up to 62% and the bandwidth can be increased by up to 400%. The Alltoall collective benchmark shows a performance benefit of up to 23% reduction in latency.

international conference on parallel processing | 2004

Applying MPI derived datatypes to the NAS benchmarks: A case study

Qingda Lu; Jiesheng Wu; Dhabaleswar K. Panda; P. Sadayappan

MPI derived datatypes are a powerful method to define arbitrary collections of non-contiguous data in memory and to enable non-contiguous data communication in a single MPI function call. In this paper, we employ MPI datatypes in four NAS benchmarks (MG, LU, BT, and SP) to transfer non-contiguous data. Comprehensive performance evaluation was carried out on two clusters: an Itanium-2 Myrinet cluster and a Xeon InfiniBand cluster. Performance results show that using datatypes can achieve performance comparable to manual packing/unpacking in the original benchmarks, though the MPI implementations that were studied also perform internal packing and unpacking on noncontiguous datatype communication. In some cases, better performance can be achieved because of the reduced costs to transfer non-contiguous data. This is because some optimizations in the MPI packing/unpacking implementations can be easily overlooked in manual packing and unpacking by users. Our case study demonstrates that MPI datatypes simplify the implementation of non-contiguous communication and lead to application code with portable performance. We expect that with further improvement of datatype processing and datatype communication such as [10, 24], datatypes can outperform the conventional methods of noncontiguous data communication. Our modified NAS benchmarks can be used to evaluate datatype processing and datatype communication in MPI implementations.

high performance distributed computing | 2003

Impact of high performance sockets on data intensive applications

Pavan Balaji; Jiesheng Wu; Tahsin M. Kurç; Dhabaleswar K. Panda; Joel H. Saltz

The challenging issues in supporting data intensive applications on clusters include efficient movement of large volumes of data between processor memories and efficient coordination of data movement and processing by a runtime support to achieve high performance. Such applications have several requirements such as guarantees in performance, scalability with these guarantees and adaptability to heterogeneous environments. With the advent of user-level protocols like the Virtual Interface Architecture (VIA) and the modern InfiniBand Architecture, the latency and bandwidth experienced by applications has approached to that of the physical network on clusters. In order to enable applications written on top of TCP/IP to take advantage of the high performance of these user-level protocols, researchers have come up with a number of techniques including User Level Sockets Layers over high performance protocols. In this paper, we study the performance and limitations of such substrate, referred to here as SocketVIA, using a component framework designed to provide runtime support for data intensive applications. The experimental results show that by reorganizing certain components of an application (in our case, the partitioning of a dataset into smaller data chunks), we can make significant improvements in application performance. This leads to a higher scalability of applications with performance guarantees. It also allows fine grained load balancing, hence making applications more adaptable to heterogeneity in resource availability. The experimental results also show that the different performance characteristics of SocketVIA allow a more efficient partitioning of data at the source nodes, thus improving the performance of the application up to an order of magnitude in some cases.

Explore More