Pete Wyckoff
Ohio Supercomputer Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pete Wyckoff.
international conference on supercomputing | 2003
Jiuxing Liu; Jiesheng Wu; Sushmitha P. Kini; Pete Wyckoff; Dhabaleswar K. Panda
Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 6.8 microseconds for small messages and a peak bandwidth of 871 Million Bytes (831 Mega Bytes) per second. Performance evaluation at the MPI level shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks.
conference on high performance computing (supercomputing) | 2001
Piyush Shivam; Pete Wyckoff; Dhabaleswar K. Panda
Modern interconnects like Myrinet and Gigabit Ethernet offer Gb/s speeds which has put the onus of reducing the communication latency on messaging software. This has led to the development of OS bypass protocols which removed the kernel from the critical path and hence reduced the end-to-end latency. With the advent of programmable NICs, many aspects of protocol processing can be offloaded from user space to the NIC leaving the host processor to dedicate more cycles to the application. Many host-offload messaging systems exist for Myrinet; however, nothing similar exits for Gigabit Ethernet. In this paper we propose Ethernet Message Passing (EMP), a completely new zero-copy, OS-bypass messaging layer for Gigabit Ethernet on Alteon NICs where the entire protocol processing is done at the NIC. This messaging system delivers very good performance (latency of 23 us, and throughput of 880 Mb/s). To the best of our knowledge, this is the .rst NIC-level implementation of a zero-copy message passing layer for Gigabit Ethernet.
international parallel and distributed processing symposium | 2004
Jiuxing Liu; Weihang Jiang; Pete Wyckoff; Dhabaleswar K. Panda; David Ashton; Darius Buntinas; William Gropp; Brian R. Toonen
Summary form only given. For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences in designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPl-1 functions in MPICH2. One of our objectives is to exploit remote direct memory access (RDMA) in InfiniBand to achieve high performance. We have based our design on the RDMA channel interface provided by MP1CH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS parallel benchmarks. Our optimized MPICH2 implementation achieves 7.6/spl mu/s latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation ofMPICH2 on InfiniBand using RDMA support.
international symposium on microarchitecture | 2004
Jiuxing Liu; B. Chandrasekaran; Weikuan Yu; Jiesheng Wu; Darius Buntinas; Sushmitha P. Kini; Dhabaleswar K. Panda; Pete Wyckoff
Todays distributed and high-performance applications require high computational power and high communication performance. Recently, the computational power of commodity PCs has doubled about every 18 months. At the same time, network interconnects that provide very low latency and very high bandwidth are also emerging. This is a promising trend in building high-performance computing environments by clustering - combining the computational power of commodity PCs with the communication performance of high-speed network interconnects. There are several network interconnects that provide low latency and high bandwidth. Traditionally, researchers have used simple microbenchmarks, such as latency and bandwidth tests, to characterize a network interconnects communication performance. Later, they proposed more sophisticated models such as LogP. However, these tests and models focus on general parallel computing systems and do not address many features present in these emerging commercial interconnects. Another way to evaluate different network interconnects is to use real-world applications. However, real applications usually run on top of a middleware layer such as the message passing interface (MPI). Our results show that to gain more insight into the performance characteristics of these interconnects, it is important to go beyond simple tests such as those for latency and bandwidth. In future, we plan to expand our microbenchmark suite to include more tests and more interconnects.
international conference on parallel processing | 2003
Jiesheng Wu; Pete Wyckoff; Dhabaleswar K. Panda
I/O is quickly emerging as the main bottleneck limiting performance in modern day clusters. The need for scalable parallel I/O and file systems is becoming more and more urgent. We examine the feasibility of leveraging infiniband technology to improve I/O performance and scalability of cluster file systems. We use parallel virtual file system (PVFS) as a basis for exploring these features. We design and implement a PVFS version on InfiniBand by taking advantage of InfiniBand features and resolving many challenging issues. We design the following: a transport layer customized for PVFS by trading transparency and generality for performance; buffer management for flow control, dynamic and fair buffer sharing, and efficient memory registration and deregistration. Compared to a PVFS implementation over standard TCP/IP on the same InfiniBand network, our implementation offers three times the bandwidth if workloads are not disk-bound and 40% improvement in bandwidth in the disk-bound case. Client CPU utilization is reduced to 1.5% from 91% on TCP/IP. To the best of our knowledge, this is the first design, implementation and evaluation of PVFS over InfiniBand. The research results demonstrate how to design high performance parallel file systems on next generation clusters with InfiniBand
international parallel and distributed processing symposium | 2006
Uday Bondhugula; Ananth Devulapalli; Joseph Fernando; Pete Wyckoff; P. Sadayappan
With rapid advances in VLSI technology, field programmable gate arrays (FPGAs) are receiving the attention of the parallel and high performance computing community. In this paper, we propose a highly parallel FPGA design for the Floyd-Warshall algorithm to solve the all-pairs shortest-paths problem in a directed graph. Our work is motivated by a computationally intensive bio-informatics application that employs this algorithm. The design we propose makes efficient and maximal utilization of the large amount of resources available on an FPGA to maximize parallelism in the presence of significant data dependences. Experimental results from a working FPGA implementation on the Cray XD1 show a speedup of 22 over execution on the XD1s processor.
international conference on cluster computing | 2002
Pavan Balaji; Piyush Shivam; Pete Wyckoff; Dhabaleswar K. Panda
While a number of user-level protocols have been developed to reduce the gap between the performance capabilities of the physical network and the performance actually available, applications that have already been developed on kernel based protocols such as TCP have largely been ignored. There is a need to make these existing TCP applications take advantage of the modern user-level protocols such as EMP or VIA which feature both low-latency and high bandwidth. We have designed, implemented and evaluated a scheme to support such applications written using the sockets API to run over EMP without any changes to the application itself. Using this scheme, we are able to achieve a latency of 28.5 /spl mu/s for the Datagram sockets and 37 /spl mu/s for Data Streaming sockets compared to a latency of 120 /spl mu/s obtained by TCP for 4-byte messages. This scheme attains a peak bandwidth of around 840 Mbps. Both the latency and the throughput numbers are close to those achievable by EMP. The ftp application shows twice as much benefit on our sockets interface while the Web server application shows up to six times performance enhancement as compared to TCP. To the best of our knowledge, this is the first such design and implementation for Gigabit Ethernet.
international parallel and distributed processing symposium | 2004
Jiesheng Wu; Pete Wyckoff; Dhabaleswar K. Panda
Summary form only given. In this paper, a systematic study of two main types of approach for MPI datatype communication (pack/unpack-based approaches and copy-reduced approaches) is carried out on the InfiniBand network. We focus on overlapping packing, network communication, and unpacking in the pack/unpack-based approaches. We use RDMA operations to avoid packing and/or unpacking in the copy-reduced approaches. Four schemes (buffer-centric segment pack/unpack, RDMA write gather with unpack, pack with RDMA read scatter, and multiple RDMA writes have been proposed. Three of them have been implemented and evaluated based on one MPI implementation over InfiniBand. Performance results of a vector microbenchmark demonstrate that latency is improved by a factor of up to 3.4 and bandwidth by a factor of up to 3.6 compared to the current datatype communication implementation. Collective operations like MPI Alltoall are demonstrated to benefit. A factor of up to 2.0 improvement has been seen in our measurements of those collective operations on an 8-node system.
conference on high performance computing (supercomputing) | 2007
Ananth Devulapalli; Dennis Dalessandro; Pete Wyckoff; Nawab Ali; P. Sadayappan
As storage systems evolve, the block-based design of todays disks is becoming inadequate. As an alternative, object-based storage devices (OSDs) offer a view where the disk manages data layout and keeps track of various attributes about data objects. By moving functionality that is traditionally the responsibility of the host OS to the disk, it is possible to improve overall performance and simplify management of a storage system. The capabilities of OSDs will also permit performance improvements in parallel file systems, such as further decoupling metadata operations and thus reducing metadata server bottlenecks. In this work we present an implementation of the Parallel Virtual File System (PVFS) integrated with a software emulator of an OSD and describe an infrastructure for client access. Even with the overhead of emulation, performance is comparable to a traditional server-fronted implementation, demonstrating that serverless parallel file systems using OSDs are an achievable goal.
cluster computing and the grid | 2005
Gaurav Khanna; Nagavijayalakshmi Vydyanathan; Tahsin M. Kurç; Pete Wyckoff; Joel H. Saltz; P. Sadayappan
This paper proposes a novel, hypergraph partitioning based strategy to schedule multiple data analysis tasks with batch-shared I/O behavior. This strategy formulates the sharing of files among tasks as a hypergraph to minimize the I/O overheads due to transferring of the same set of files multiple times and employs a dynamic scheme for file transfers to reduce contention on the storage system. We experimentally evaluate the proposed approach using application emulators from two application domains; analysis of remotely-sensed data and biomedical imaging.