Jonathan L. Perkins
Ohio State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jonathan L. Perkins.
ieee international conference on high performance computing, data, and analytics | 2008
Jaidev K. Sridhar; Matthew J. Koop; Jonathan L. Perkins; Dhabaleswar K. Panda
As cluster sizes head into tens of thousands, current joblaunchmechanisms do not scale as they are limited by resource constraintsas well as performance bottlenecks. The job launch process includes twophases - spawning of processes on processors and information exchange betweenprocesses for job initialization. Implementations of various programmingmodels follow distinct protocols for the information exchange phase.We present the design of a scalable, extensible and high-performance joblaunch architecture for very large scale parallel computing. We present implementationsof this architecture which achieve a speedup of more than700% in launching a simple Hello World MPI application on 10, 240 processorcores and also scale to more than 3 times the number of processorcores compared to prior solutions.
ieee international conference on high performance computing, data, and analytics | 2014
Rong Shi; Sreeram Potluri; Khaled Hamidouche; Jonathan L. Perkins; Mingzhe Li; Davide Rossetti; Dhabaleswar K. Panda
Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.
international conference on cluster computing | 2013
Hari Subramoni; Devendar Bureddy; Krishna Chaitanya Kandalla; Karl W. Schulz; Bill Barth; Jonathan L. Perkins; Mark Daniel Arnold; Dhabaleswar K. Panda
The goal of any scheduler is to satisfy users demands for computation and achieve a good performance in overall system utilization by efficiently assigning jobs to resources. However, the current state-of-the-art scheduling techniques do not intelligently balance node allocation based on the total bandwidth available between switches - that leads to over subscription. Additionally, poor placement of processes can lead to network congestion and poor performance. In this paper, we explore the design of a network-topology-aware plugin for the SLURM job scheduler for modern InfiniBand-based clusters. We present designs to enhance the performance of applications with varying communication characteristics. Through our techniques, we are able to considerably reduce the amount of network contention observed during the Alltoall / FFT operations. The results of our experimental evaluation indicate that our proposed technique is able to deliver up to a 9% improvement in the communication time of P3DFFT at 512 processes. We also see that our techniques are able to increase the performance of microbenchmarks that rely on point-to-point operations up to 40% for all message sizes. Our techniques were also able to improve the throughput of a 512-core cluster by up to 8%.
Proceedings of the 21st European MPI Users' Group Meeting on | 2014
Sourav Chakraborty; Hari Subramoni; Jonathan L. Perkins; Adam Moody; Mark Daniel Arnold; Dhabaleswar K. Panda
An efficient implementation of the Process Management Interface (PMI) is crucial to enable a scalable startup of MPI jobs. We propose three extensions to the PMI specification: a ring exchange collective, a broadcast hint to Put, and an enhanced Get. We design and evaluate several PMI implementations that reduce startup costs from scaling as O(P) to O(k), where k is the number of keys read by the processes on each node and P is the number of processes. Our experimental evaluations show these extensions can speed up launch time of MPI jobs by 33% at 8,192 cores.
Proceedings of the 21st European MPI Users' Group Meeting on | 2014
Raghunath Rajachandrasekar; Jonathan L. Perkins; Khaled Hamidouche; Mark Daniel Arnold; Dhabaleswar K. Panda
The MPI Tools information interface (MPI_T), introduced as part of MPI 3.0 standard, has been gaining momentum in both the MPI and performance tools communities. In this paper, we investigate the challenges involved in profiling the memory utilization characteristics of MPI libraries that can be exposed to tools and libraries leveraging the MPI_T interface. We propose three design alternatives to enable such profiling from within MPI, and study their viability in light of these challenges. We analyze the benefits and shortcomings of each of them in detail, with a particular focus on the performance and memory overheads that they introduce. We evaluate the performance and scalability of these designs using micro-benchmarks, MPI-level benchmarks and applications. The overheads of the proposed design amounts to just 0.8% of the MILC application runtime with 4,096 processes. The paper also presents a case study that uses the MPI_T memory profiling information in MVAPICH2 to optimize the memory utilization of UH3D application runs, where memory savings of up to 7.3x was achieved.
ieee international conference on high performance computing, data, and analytics | 2016
Hari Subramoni; Albert Mathews Augustine; Mark W. Arnold; Jonathan L. Perkins; Xiaoyi Lu; Khaled Hamidouche; Dhabaleswar K. Panda
Modern high-end computing is being driven by the tight integration of several hardware and software components. On the hardware front, there are the multi-/many-core architectures (including accelerators and co-processors) and high-end interconnects like InfiniBand that are continually pushing the envelope of raw performance. On the software side, there are several high performance implementations of popular parallel programming models that are designed to take advantage of the high-end features offered by the hardware components and deliver multi-petaflop level performance to end applications. Together, these components allow scientists and engineers to tackle grand challenge problems in their respective domains.
ieee/acm international symposium cluster, cloud and grid computing | 2015
Sourav Chakraborty; Hari Subramoni; Adam Moody; Akshay Venkatesh; Jonathan L. Perkins; Dhabaleswar K. Panda
An efficient implementation of the Process Management Interface (PMI) is crucial to enable fast start-up of MPI jobs. We propose three extensions to the PMI specification: 1) a blocking all gather collective (PMIX_Allgather), 2) a non-blocking all gather collective (PMIX_Iallgather), and 3) a non-blocking fence (PMIX_KVS_Ifence). We design and evaluate several PMI implementations to demonstrate how such extensions reduce MPI start-up cost. In particular, when sufficient work can be overlapped, these extensions allow for a constant initialization cost of MPI jobs at different core counts. At 16,384 cores, the designs lead to a speedup of 2.88 times over the state-of-the-art start-up schemes.
Proceedings of the 22nd European MPI Users' Group Meeting on | 2015
Ammar Ahmad Awan; Khaled Hamidouche; Akshay Venkatesh; Jonathan L. Perkins; Hari Subramoni; Dhabaleswar K. Panda
As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.
international parallel and distributed processing symposium | 2015
Sourav Chakraborty; Hari Subramoni; Jonathan L. Perkins; Ammar Ahmad Awan; Dhabaleswar K. Panda
Partitioned Global Address Space (PGAS) programming models like Open SHMEM and hybrid models like Open SHMEM+MPI can deliver high performance and improved programmability. However, current implementations of Open SHMEM assume a fully-connected process model which affects their performance and scalability. We address this critical issue by designing on-demand connection management support for Open SHMEM which significantly improves the startup performance and reduces the resource usage. We further enhance the Open SHMEM startup performance by utilizing non-blocking out-of-band communication APIs. We evaluate our designs using a set of micro benchmarks and applications and observe 30 times reduction in Open SHMEM initialization time and 8.3 times improvement in execution time of a Hello World application at 8,192 processes. In particular, when sufficient work can be overlapped, we show that use of non-blocking out-of-band communication APIs allow for a constant initialization cost of Open SHMEM jobs at different core counts. We also obtain up to 90% reduction in number of network endpoints and up to 35% improvement in application execution time with NAS Parallel Benchmarks.
cluster computing and the grid | 2016
Sourav Chakraborty; Hari Subramoni; Jonathan L. Perkins; Dhabaleswar K. Panda
Dense systems with large number of cores per node are becoming increasingly popular. Existing designs of the Process Management Interface (PMI) show poor scalability in terms of performance and memory consumption on such systems with large number of processes concurrently accessing the PMI interface. Our analysis shows the local socket-based communication scheme used by PMI to be a major bottleneck. While using a shared memory based channel can avoid this bottleneck and thus reduce memory consumption and improve performance, there are several challenges associated with such a design. We investigate several such alternatives and propose a novel design that is based on a hybrid socket+shared memory based communication protocol and uses multiple shared memory regions. This design can reduce the memory usage per node by a factor of Processes per Node. Our evaluations show that memory consumption per node can be reduced by an estimated 1GB with 1 million MPI processes and 16 processes per node. Additionally, performance of PMI Get is improved by 1,000 times compared to the existing design. The proposed design is backward compatible, secure, and imposes negligible overhead.