Ammar Ahmad Awan
Ohio State University
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Ammar Ahmad Awan.
acm sigplan symposium on principles and practice of parallel programming | 2017
Ammar Ahmad Awan; Khaled Hamidouche; Jahanzeb Maqbool Hashmi; Dhabaleswar K. Panda
Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, S-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters. With an in-depth analysis of new requirements brought forward by the DL frameworks and limitations of current communication runtimes, we present a co-design of the Caffe framework and the MVAPICH2-GDR MPI runtime. Using the co-design methodology, we modify Caffes workflow to maximize the overlap of computation and communication with multi-stage data propagation and gradient aggregation schemes. We bring DL-Awareness to the MPI runtime by proposing a hierarchical reduction design that benefits from CUDA-Aware features and provides up to a massive 133x speedup over OpenMPI and 2.6x speedup over MVAPICH2 for 160 GPUs. S-Caffe successfully scales up to 160 K-80 GPUs for GoogLeNet (ImageNet) with a speedup of 2.5x over 32 GPUs. To the best of our knowledge, this is the first framework that scales up to 160 GPUs. Furthermore, even for single node training, S-Caffe shows an improvement of 14\% and 9\% over Nvidias optimized Caffe for 8 and 16 GPUs, respectively. In addition, S-Caffe achieves up to 1395 samples per second for the AlexNet model, which is comparable to the performance of Microsoft CNTK.
Proceedings of the 23rd European MPI Users' Group Meeting on | 2016
Ammar Ahmad Awan; Khaled Hamidouche; Akshay Venkatesh; Dhabaleswar K. Panda
Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks generally reside on a GPUs memory. In this context, we observe that conventional MPI runtimes have been optimized over decades to achieve lowest possible communication latency for relatively smaller message sizes (up-to 1 Megabyte) and that too for CPU memory buffers. With the advent of CUDA-Aware MPI runtimes, a lot of research has been conducted to improve performance of GPU buffer based communication. However, little exists in current state of the art that deals with very large message communication of GPU buffers. In this paper, we investigate these new challenges by analyzing the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and propose hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL. To the best of our knowledge, this is the first work that addresses these new requirements where GPU buffers are used for communication with message sizes surpassing hundreds of megabytes. We highlight the design challenges for our work along with the details of design and implementation. In addition, we provide a comprehensive performance evaluation using a Micro-benchmark and a CUDA-Aware adaptation of Microsoft CNTK DL framework. We report up to 47% improvement in training time for CNTK using the proposed hierarchical MPI_Bcast design.
ieee international conference on high performance computing, data, and analytics | 2015
Hari Subramoni; Ammar Ahmad Awan; Khaled Hamidouche; Dmitry Pekurovsky; Akshay Venkatesh; Sourav Chakraborty; Karen Tomko; Dhabaleswar K. Panda
Several techniques have been proposed in the past for designing non-blocking collective operations on high-performance clusters. While some of them required a dedicated process/thread or periodic probing to progress the collective others needed specialized hardware solutions. The former technique, while applicable to any generic HPC cluster, had the drawback of stealing CPU cycles away from the compute task. The latter gave near perfect overlap but increased the total cost of the HPC installation due to need for specialized hardware and also had other drawbacks that limited its applicability. On the other hand, the Remote Direct Memory Access technology and high performance networks have been pushing the envelope of HPC performance to multi-petaflop levels. However, no scholarly work exists that explores the impact such RDMA technology can bring to the design of non-blocking collective primitives. In this paper, we take up this challenge and propose efficient designs of personalized non-blocking collective operations on top of the basic RDMA primitives. Our experimental evaluation shows that our proposed designs are able to deliver near perfect overlap of computation and communication for personalized collective operations on modern HPC systems at scale. At the microbenchmark level, the proposed RDMA-Aware collectives deliver improvements in latency of up to 89 times for MPI_Igatherv, 3.71 times for MPI_Ialltoall and, 3.23 times for MPI_Iscatter over the state-of-the-art designs. We also observe an improvement of up to 19 % for the P3DFFT kernel at 8,192 cores on the Stampede supercomputing system at TACC.
cluster computing and the grid | 2016
Ching-Hsiang Chu; Khaled Hamidouche; Akshay Venkatesh; Ammar Ahmad Awan; Dhabaleswar K. Panda
Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.
international conference on parallel processing | 2017
Ching-Hsiang Chu; Xiaoyi Lu; Ammar Ahmad Awan; Hari Subramoni; Jahanzeb Maqbool Hashmi; Bracy Elton; Dhabaleswar K. Panda
Broadcast operations (e.g. MPI_Bcast) have been widely used in deep learning applications to exchange a large amount of data among multiple graphics processing units (GPUs). Recent studies have shown that leveraging the InfiniBand hardware-based multicast (IB-MCAST) protocol can enhance scalability of GPU-based broadcast operations. However, these initial designs with IB-MCAST are not optimized for multi-source broadcast operations with large messages, which is the common communication scenario for deep learning applications. In this paper, we first model existing broadcast schemes and analyze their performance bottlenecks on GPU clusters. Then, we propose a novel broadcast design based on message streaming to better exploit IB-MCAST and NVIDIA GPUDirect RDMA (GDR) technology for efficient large message transfer operation. The proposed design can provide high overlap among multi-source broadcast operations. Experimental results show up to 68% reduction of latency compared to state-of-the-art solutions in a benchmark-level evaluation. The proposed design also shows near-constant latency for a single broadcast operation as a system grows. Furthermore, it yields up to 24% performance improvement in the popular deep learning framework, Microsoft CNTK, which uses multi-source broadcast operations; notably, the performance gains are achieved without modifications to applications. Our model validation shows that the proposed analytical model and experimental results match within a 10% range. Our model also predicts that the proposed design outperforms existing schemes for multi-source broadcast scenarios with increasing numbers of broadcast sources in large-scale GPU clusters.
Proceedings of the 22nd European MPI Users' Group Meeting on | 2015
Ammar Ahmad Awan; Khaled Hamidouche; Akshay Venkatesh; Jonathan L. Perkins; Hari Subramoni; Dhabaleswar K. Panda
As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important trend, especially with the introduction of non-blocking collective communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware Non-Blocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware benchmark and discuss the challenges associated with identifying and implementing performance parameters like overlap, latency, effect of MPI_Test() calls to progress communication, effect of independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate the efficacy of the proposed benchmark, we provide a comparative performance evaluation of GPU-Aware Non-Blocking Collectives in MVAPICH2 and OpenMPI.
arXiv: Distributed, Parallel, and Cluster Computing | 2018
Ammar Ahmad Awan; Ching-Hsiang Chu; Hari Subramoni; Dhabaleswar K. Panda
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NCCL have been proposed. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and internode broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. The proposed solutions outperform the recently introduced NCCL2 library for small and medium message sizes and offer comparable/better performance for very large message sizes.
Machine Learning | 2017
Ammar Ahmad Awan; Hari Subramoni; Dhabaleswar K. Panda
Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through sophisticated software frameworks like cuDNN and cuBLAS. However, recent enhancements to CPU-based hardware and software has the potential to significantly enhance the performance of CPU-based DL training. In this paper, we provide a complete performance landscape of CPU- and GPU-based DNN training. We characterize performance of DNN training for AlexNet and ResNet-50 for a wide-range of CPU and GPU architectures including the latest Intel Xeon Phi (Knights Landing) processors and NVIDIA Pascal GPUs. We also present multi-node DNN training performance results for AlexNet and ResNet-50 using Intel Machine Learning Scaling (MLSL) Library and Intel-Caffe. In addition, we provide a CPU vs. GPU comparison for multi-node training using OSU-Caffe and Intel-Caffe. To the best of our knowledge, this is the first study that dives deeper into the performance of DNN training in a holistic manner yet provides an in-depth look at layer-wise performance for different DNNs. We provide multiple key insights: 1) Convolutions account for the majority of time (up to 83% time) consumed in DNN training, 2) GPU-based training continues to deliver excellent performance (up to 18% better than KNL) across generations of GPU hardware and software, and 3) Recent CPU-based optimizations like MKL-DNN and OpenMP-based thread parallelism leads to excellent speed-ups over under-optimized designs (up to 3.2X improvement for AlexNet training).
international parallel and distributed processing symposium | 2015
Sourav Chakraborty; Hari Subramoni; Jonathan L. Perkins; Ammar Ahmad Awan; Dhabaleswar K. Panda
Partitioned Global Address Space (PGAS) programming models like Open SHMEM and hybrid models like Open SHMEM+MPI can deliver high performance and improved programmability. However, current implementations of Open SHMEM assume a fully-connected process model which affects their performance and scalability. We address this critical issue by designing on-demand connection management support for Open SHMEM which significantly improves the startup performance and reduces the resource usage. We further enhance the Open SHMEM startup performance by utilizing non-blocking out-of-band communication APIs. We evaluate our designs using a set of micro benchmarks and applications and observe 30 times reduction in Open SHMEM initialization time and 8.3 times improvement in execution time of a Hello World application at 8,192 processes. In particular, when sufficient work can be overlapped, we show that use of non-blocking out-of-band communication APIs allow for a constant initialization cost of Open SHMEM jobs at different core counts. We also obtain up to 90% reduction in number of network endpoints and up to 35% improvement in application execution time with NAS Parallel Benchmarks.
OpenSHMEM 2015 Revised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 9397 | 2015
Ammar Ahmad Awan; Khaled Hamidouche; Ching-Hsiang Chu; Dhabaleswar K. Panda
An ever increased push for performance in the HPC arena has led to a multitude of hybrid architectures in both software and hardware for HPC systems. Partitioned Global Address Space PGAS programming model has gained a lot of attention over the last couple of years. The main advantage of PGAS model is the ease of programming provided by the abstraction of a single memory across nodes of a cluster. OpenSHMEM implementations currently implement the OpenSHMEM 1.2 specification that provides interface for one-sided, atomic, and collective operations. However, the recent trend in HPC arena in general, and Message Passing Interface MPI community in specific, is to use Non-Blocking Collective NBC communication to efficiently overlap computation with communication to save precious CPU cycles. This work is inspired by encouraging performance numbers for NBC implementations of various MPI libraries. As the OpenSHMEM community has been discussing the use of non-blocking communication, in this paper, we propose an NBC interface for OpenSHMEM, present its design, implementation, and performance evaluation. We discuss the NBC interface that has been modeled along the lines of MPI NBC interface and requires minimal changes to the function signatures. We have designed and implemented this interface using the Unified Communication Runtime in MVAPICH2-X. In addition, we propose OpenSHMEM NBC benchmarks as an extension to the OpenSHMEM benchmarks available in the widely used OMB suite. Our performance evaluation shows that the proposed NBC implementation provides upi¾?to 96 percent overlap for different collectives with little NBC overhead.
