Matthew J. Koop | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthew J. Koop is active.

Explore More

Publication

Featured researches published by Matthew J. Koop.

conference on high performance computing (supercomputing) | 2007

Virtual machine aware communication libraries for high performance computing

Wei Huang; Matthew J. Koop; Qi Gao; Dhabaleswar K. Panda

As the size and complexity of modern computing systems keep increasing to meet the demanding requirements of High Performance Computing (HPC) applications, manageability is becoming a critical concern to achieve both high performance and high productivity computing. Meanwhile, virtual machine (VM) technologies have become popular in both industry and academia due to various features designed to ease system management and administration. While a VM-based environment can greatly help manageability on large-scale computing systems, concerns over performance have largely blocked the HPC community from embracing VM technologies. In this paper, we follow three steps to demonstrate the ability to achieve near-native performance in a VM-based environment for HPC. First, we propose Inter-VM Communication (IVC), a VM-aware communication library to support efficient shared memory communication among computing processes on the same physical host, even though they may be in different VMs. This is critical for multi-core systems, especially when individual computing processes are hosted on different VMs to achieve fine-grained control. Second, we design a VM-aware MPI library based on MVAPICH2 (a popular MPI library), called MVAPICH2-ivc, which allows HPC MPI applications to transparently benefit from IVC. Finally, we evaluate MVAPICH2-ivc on clusters featuring multi-core systems and high performance InfiniBand interconnects. Our evaluation demonstrates that MVAPICH2-ivc can improve NAS Parallel Benchmark performance by up to 11% in VM-based environment on eight-core Intel Clover-town systems, where each compute process is in a separate VM. A detailed performance evaluation for up to 128 processes (64 node dual-socket single-core systems) shows only a marginal performance overhead of MVAPICH2-ivc as compared with MVAPICH2 running in a native environment. This study indicates that performance should no longer be a barrier preventing HPC environments from taking advantage of the various features available through VM technologies.

conference on high performance computing (supercomputing) | 2006

High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis

Sayantan Sur; Matthew J. Koop; Dhabaleswar K. Panda

InfiniBand is an emerging HPC interconnect being deployed in very large scale clusters, with even larger InfiniBand-based clusters expected to be deployed in the near future. The message passing interface (MPI) is the programming model of choice for scientific applications running on these large scale clusters. Thus, it is very critical for the MPI implementation used to be based on a scalable and high-performance design. We analyze the performance and scalability aspects of MVAPICH, a popular open-source MPI implementation on InfiniBand, from an application standpoint. We analyze the performance and memory requirements of the MPI library while executing several well-known applications and benchmarks, such as NAS, SuperLU, NAMD, and HPL on a 64-node InfiniBand cluster. Our analysis reveals that latest design of MVAPICH requires an order of magnitude less internal MPI memory (average per process) and yet delivers the best possible performance. Further, we observe that for these benchmarks and applications evaluated, the internal memory requirement of MVAPICH remains nearly constant at around 5-10 MB as the number of processes increase, indicating that the MVAPICH design is highly scalable

international conference on supercomputing | 2007

High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

Matthew J. Koop; Sayantan Sur; Qi Gao; Dhabaleswar K. Panda

High-performance clusters have been growing rapidly in scale. Most of these clusters deploy a high-speed interconnect, such as Infini-Band, to achieve higher performance. Most scientific applications executing on these clusters use the Message Passing Interface (MPI) as the parallel programming model. Thus, the MPI library has a key role in achieving application performance by consuming as few resources as possible and enabling scalable performance. State-of-the-art MPI implementations over InfiniBand primarily use the Reliable Connection (RC) transport due to its good performance and attractive features. However, the RC transport requires a connection between every pair of communicating processes, with each requiring several KB of memory. As clusters continue to scale, memory requirements in RC-based implementations increase. The connection-less Unreliable Datagram (UD) transport is an attractive alternative, which eliminates the need to dedicate memory for each pair of processes. In this paper we present a high-performance UD-based MPI design. We implement our design and compare the performance and resource usage with the RC-based MVAPICH. We evaluate NPB, SMG2000, Sweep3D, and sPPM up to 4K processes on an 9216-core InfiniBand cluster. For SMG2000, our prototype shows a 60% speedup and seven-fold reduction in memory for 4K processes. Additionally, based on our model, our design has an estimated 30 times reduction in memory over MVAPICH at 16K processes when all connections are created. To the best of our knowledge, this is the first research work that presents a high-performance MPI design over InfiniBand that is completely based on UD and can achieve near identical or better application performance than RC.

virtual execution environments | 2007

Nomad: migrating OS-bypass networks in virtual machines

Wei Huang; Jiuxing Liu; Matthew J. Koop; Bulent Abali; Dhabaleswar K. Panda

Virtual machine (VM) technology is experiencing a resurgence due to various benefits including ease of management, security and resource consolidation. Live migration of virtual machines allows transparent movement of OS instances and hosted applications across physical machines. It is one of the most useful features of VM technology because it provides a powerful tool for effective administration of modern cluster environments. Migrating network resources is one of the key problems that need to be addressed in the VM migration process. Existing studies of VM migration have focused on traditional I/O interfaces such as Ethernet. However, modern high-speed interconnects with intelligent NICs pose significantly more challenges as they have additional features including hardware level reliable services and direct I/O accesses. In this paper we present Nomad, a design for migrating modern interconnects with the aforementioned features, focusing on cluster environments running VMs. We introduce a thin namespace virtualization layer to efficiently address location dependent resource handles and a handshake protocol which transparently maintains reliable service semantics during migration. We demonstrate our design by implementing a prototype based on the Xen virtual machine monitor and InfiniBand. Our performance analysis shows that Nomad can achieve efficient migration of network resources, even in environments with stringent communication performance requirements.

high performance interconnects | 2008

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Matthew J. Koop; Wei Huang; Karthik Gopalakrishnan; Dhabaleswar K. Panda

High-performance systems are undergoing a major shift as commodity multi-core systems become increasingly prevalent. As the number of processes per compute node increase, the other parts of the system must also scale appropriately to maintain a balanced system. In the area of high-performance computing, one very important element of the overall system is the network interconnect that connects compute nodes in the system. InfiniBand is a popular interconnect for high- performance clusters. Unfortunately, due to limited bandwidth of the PCI-Express fabric, InfiniBand performance has remained limited. PCI-Express (PCIe) 2.0 has become available and has doubled the transfer rates available. This additional I/O bandwidth balances the system and makes higher data rates for external interconnects such as InfiniBand feasible. As a result, InfiniBand quad-data rate (QDR) mode has become available on the Mellanox InfiniBand host channel adapter (HCA) with a 40 Gb/sec signaling rate. In this paper we perform an in-depth performance analysis of PCIe 2.0 and the effect of increased InfiniBand signaling rates. We show that even using the double data rate (DDR) interface, PCIe 2.0 enables a 25% improvement in NAS parallel benchmark IS performance. Furthermore, we show that when using QDR on PCIe 2.0, network loopback can outperform a shared memory message passing implementation. We show that increased interconnection bandwidth significantly improves the overall system balance by lowering latency and increasing bandwidth.

high performance interconnects | 2007

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Sayantan Sur; Matthew J. Koop; Lei Chai; Dhabaleswar K. Panda

InfiniBand is an emerging networking technology that is gaining rapid acceptance in the HPC domain. Currently, several systems in the Top500 list use InfiniBand as their primary interconnect, with more being planned for near future. The fundamental architecture of the systems are undergoing a sea-change due to the advent of commodity multi-core computing. Due to the increase in the number of processes in each compute node, the network interface is expected to handle more communication traffic as compared to older dual or quad SMP systems. Thus, the network architecture should provide scalable performance as the number of processing cores increase. ConnectX is the fourth generation InfiniBand adapter from Mellanox Technologies. Its novel architecture enhances the scalability and performance of InfiniBand on multi-core clusters. In this paper, we carry out an in-depth performance analysis of ConnectX architecture comparing it with the third generation InfiniHost III architecture on the Intel Bensley platform with Dual Clovertown processors. Our analysis reveals that the aggregate bandwidth for small and medium sized messages can be increased by a factor of 10 as compared to the third generation InfiniHost III adapters. Similarly, RDMA-Write and RDMA-Read latencies for 1 -byte messages can be reduced by a factor of 6 and 3, respectively, even when all cores are communicating simultaneously. Evaluation with communication kernel Halo reveals a performance benefit of a factor of 2 to 5. Finally, the performance of LAMMPS, a molecular dynamics simulator, is improved by 10% for the in.rhodo benchmark.

cluster computing and the grid | 2007

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective

Abhinav Vishnu; Matthew J. Koop; Adam Moody; Amith R. Mamidala; Sundeep Narravula; Dhabaleswar K. Panda

Large scale InfiniBand clusters are becoming increasingly popular, as reflected by the TOP 500 supercomputer rankings. At the same time, fat tree has become a popular interconnection topology for these clusters, since it allows multiple paths to be available in between a pair of nodes. However, even with fat tree, hot-spots may occur in the network depending upon the route configuration between end nodes and communication pattern(s) in the application. To make matters worse, the deterministic routing nature of InfiniBand limits the application from effective use of multiple paths transparently and avoid the hot-spots in the network. Simulation based studies for switches and adapters to implement congestion control have been proposed in the literature. However, these studies have focussed on providing congestion control for the communication path, and not on utilizing multiple paths in the network for hot-spot avoidance. In this paper, we design an MPI functionality, which provides hot-spot avoidance for different communications, without a priori knowledge of the pattern. We leverage LMC (LID mask count) mechanism of InfiniBand to create multiple paths in the network and present the design issues (scheduling policies, selecting number of paths, scalability aspects) of our design. We implement our design and evaluate it with Pallas collective communication and MPI applications. On an InfiniBand cluster with 48 processes, MPI All-to-all personalized shows an improvement of 27%. Our evaluation with NAS parallel benchmarks on 64 processes shows significant improvement in execution time with this functionality.

international parallel and distributed processing symposium | 2009

Designing multi-leader-based Allgather algorithms for multi-core clusters

Krishna Chaitanya Kandalla; Hari Subramoni; Gopalakrishnan Santhanaraman; Matthew J. Koop; Dhabaleswar K. Panda

The increasing demand for computational cycles is being met by the use of multi-core processors. Having large number of cores per node necessitates multi-core aware designs to extract the best performance. The Message Passing Interface (MPI) is the dominant parallel programming model on modern high performance computing clusters. The MPI collective operations take a significant portion of the communication time for an application. The existing optimizations for collectives exploit shared memory for intra-node communication to improve performance. However, it still would not scale well as the number of cores per node increase. In this work, we propose a novel and scalable multi-leader-based hierarchical Allgather design. This design allows better cache sharing for Non-Uniform Memory Access (NUMA) machines and makes better use of the network speed available with high performance interconnects such as InfiniBand. The new multi-leader-based scheme achieves a performance improvement of up to 58% for small messages and 70% for medium sized messages.

international parallel and distributed processing symposium | 2008

MVAPICH-Aptus: Scalable high-performance multi-transport MPI over InfiniBand

Matthew J. Koop; Terry Jones; Dhabaleswar K. Panda

The need for computational cycles continues to exceed availability, driving commodity clusters to increasing scales. With upcoming clusters containing tens-of-thousands of cores, InfiniBand is a popular interconnect on these clusters, due to its low latency (1.5 musec) and high bandwidth (1.5 GB/sec). Since most scientific applications running on these clusters are written using the message passing interface (MPI) as the parallel programming model, the MPI library plays a key role in the performance and scalability of the system. Nearly all MPIs implemented over InfiniBand currently use the reliable connection (RC) transport of InfiniBand to implement message passing. Using this transport exclusively, however, has been shown to potentially reach a memory footprint of over 200 MB/task at 16 K tasks for the MPI library. The Unreliable Datagram (UD) transport, however, offers higher scalability, but at the cost of medium and large message performance. In this paper we present a multi-transport MPI design, MVAPICH-Aptus, that uses both the RC and UD transports of InfiniBand to deliver scalability and performance higher than that of a single-transport MPI design. Evaluation of our hybrid design on 512 cores shows a 12% improvement over an RC-based design and 4% better than a UD-based design for the SMG2000 application benchmark. In addition, for the molecular dynamics application NAMD we show a 10% improvement over an RC-only design. To the best of our knowledge, this is the first such analysis and design of optimized MPI using both UD and RC.

cluster computing and the grid | 2007

Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach

Matthew J. Koop; Terry Jones; Dhabaleswar K. Panda

Clusters in the area of high-performance computing have been growing in size at a considerable rate. In these clusters, the dominate programming model is the Message Passing Interface (MPI), so the MPI library has a key role in resource usage and performance. To obtain maximal performance, many clusters deploy a high-speed interconnect between compute nodes. One such interconnect, InfiniBand, has been gaining in popularity due to its various features including Remote Data Memory Access (RDMA), and high-performance. As a result, it is being deployed in a significant number of clusters and has been chosen as the standard interconnect for capacity clusters within the DOE Tri-Labs. As these clusters grow in size, care must be taken to ensure the resource usage does not increase too significantly with scale. In particular, the MPI library resource usage should not grow at a rate which will exhaust the node memory or starve user applications. In this paper we present our findings of current memory usage when all connections are created and design a message coalescing method to decrease memory usage significantly. Our models show that the default configuration of MVAPICH can grow to 1GB per process for 8K processes, while our enhancements reduce usage by an order of magnitude to around 120 MB per process while maintaining near-equal performance. We have validated our design on a 575-node cluster and shown no performance degradation for a variety of applications. We also increase the message rate attainable by over 150%.

Explore More