Khaled Hamidouche
Ohio State University
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Khaled Hamidouche.
international conference on parallel processing | 2013
Sreeram Potluri; Khaled Hamidouche; Akshay Venkatesh; Devendar Bureddy; Dhabaleswar K. Panda
GPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck in harnessing the full potential of a GPU. Data in the GPU memory has to be moved into the host memory before it can be sent over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using techniques like pipelining. GPUDirect RDMA is a feature introduced in CUDA 5.0, that allows third party devices like network adapters to directly access data in GPU device memory, over the PCIe bus. NVIDIA has partnered with Mellanox to make this solution available for InfiniBand clusters. In this paper, we evaluate the first version of GPUDirect RDMA for InfiniBand and propose designs in MVAPICH2 MPI library to efficiently take advantage of this feature. We highlight the limitations posed by current generation architectures in effectively using GPUDirect RDMA and address these issues through novel designs in MVAPICH2. To the best of our knowledge, this is the first work to demonstrate a solution for internode GPU-to-GPU MPI communication using GPUDirect RDMA. Results show that the proposed designs improve the latency of internode GPU-to-GPU communication using MPI Send/MPI Recv by 69% and 32% for 4Byte and 128KByte messages, respectively. The designs boost the uni-directional bandwidth achieved using 4KByte and 64KByte messages by 2x and 35%, respectively. We demonstrate the impact of the proposed designs using two end-applications: LBMGPU and AWP-ODC. They improve the communication times in these applications by up to 35% and 40%, respectively.
ieee international conference on high performance computing data and analytics | 2013
Sreeram Potluri; Devendar Bureddy; Khaled Hamidouche; Akshay Venkatesh; Krishna Chaitanya Kandalla; Hari Subramoni; Dhabaleswar K. Panda
Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86_64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.
acm sigplan symposium on principles and practice of parallel programming | 2017
Ammar Ahmad Awan; Khaled Hamidouche; Jahanzeb Maqbool Hashmi; Dhabaleswar K. Panda
Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, S-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters. With an in-depth analysis of new requirements brought forward by the DL frameworks and limitations of current communication runtimes, we present a co-design of the Caffe framework and the MVAPICH2-GDR MPI runtime. Using the co-design methodology, we modify Caffes workflow to maximize the overlap of computation and communication with multi-stage data propagation and gradient aggregation schemes. We bring DL-Awareness to the MPI runtime by proposing a hierarchical reduction design that benefits from CUDA-Aware features and provides up to a massive 133x speedup over OpenMPI and 2.6x speedup over MVAPICH2 for 160 GPUs. S-Caffe successfully scales up to 160 K-80 GPUs for GoogLeNet (ImageNet) with a speedup of 2.5x over 32 GPUs. To the best of our knowledge, this is the first framework that scales up to 160 GPUs. Furthermore, even for single node training, S-Caffe shows an improvement of 14\% and 9\% over Nvidias optimized Caffe for 8 and 16 GPUs, respectively. In addition, S-Caffe achieves up to 1395 samples per second for the AlexNet model, which is comparable to the performance of Microsoft CNTK.
international conference on supercomputing | 2013
Khaled Hamidouche; Sreeram Potluri; Hari Subramoni; Krishna Chaitanya Kandalla; Dhabaleswar K. Panda
Xeon Phi, the latestMany Integrated Core (MIC) co-processor from Intel, packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. One of the easiest way to take advantage of the MIC is to use compiler directives to offoad appropriate compute tasks of an application. However, with the Xeon Phi being an expensive resource, it is believed that production systems will be designed in a heterogeneous manner with only a subset of compute nodes comprising the MIC co-processor. Moreover, not all applications will be able to take advantage of the complete compute power offered by a Xeon Phi. In such scenarios, the existing state-of-the-art frameworks which require applications to be scheduled on compute nodes that have the MIC co- processor, lead to inefficient utilization of the computing power offered by the MIC. In order to address this limitation, it is critical to design an efficient framework to facilitate applications to offload compute tasks on remote MICs. In this paper, we take on this challenge and design MIC-RO - a novel framework to enable efficient remote offload on heterogeneous MIC clusters. To the best of our knowledge, this is the first design that enables application scientists to offload computation to remote MICs. Our experimental results show that, using MIC-RO, applications are able to offload computation to remote MICs with no overhead compared to offloading on local MICs. Moreover, MIC-RO outperforms the default Intel compiler based offload techniques by up to a factor of two for multiple benchmarks and application kernels.
high performance interconnects | 2013
Krishna Chaitanya Kandalla; Akshay Venkatesh; Khaled Hamidouche; Sreeram Potluri; Devendar Bureddy; Dhabaleswar K. Panda
The emergence of co-processors such as Intel Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Furthermore, the communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models | 2014
Jithin Jose; Sreeram Potluri; Hari Subramoni; Xiaoyi Lu; Khaled Hamidouche; Karl W. Schulz; Hari Sundar; Dhabaleswar K. Panda
While Hadoop holds the current Sort Benchmark record, previous research has shown that MPI-based solutions can deliver similar performance. However, most existing MPI-based designs rely on two-sided communication semantics. The emerging Partitioned Global Address Space (PGAS) programming model presents a flexible way to express parallelism for data-intensive applications. However, not all portions of the data analytics applications are amenable to conversion using PGAS models. In this study, we propose a novel design of the out-of-core, k-way parallel sort algorithm that takes advantage of the features of both MPI and OpenSHMEM PGAS models. To the best of our knowledge, this is the first design of any data intensive computing application using Hybrid MPI + PGAS models. Our experimental evaluation indicates that our proposed framework outperforms existing MPI-based design by up to 45% at 8,192 processes. It also achieves 7X improvement over Hadoop-based sort using the same amount of resources at 1,024 cores.
international conference on supercomputing | 2014
Hari Subramoni; Khaled Hamidouche; Akshey Venkatesh; Sourav Chakraborty; Dhabaleswar K. Panda
The Dynamic Connected DC InfiniBand transport protocol has recently been introduced by Mellanox to address several shortcomings of the older Reliable Connection RC, eXtended Reliable Connection XRC, and Unreliable Datagram UD transport protocols. DC aims to support all of the features provided by RC -- such as RDMA, atomics, and hardware reliability -- while allowing processes to communicate with any remote process with just one DC queue pair QP, like UD. In this paper we present the salient features of the new DC protocol including its connection and communication models.We design new verbs-level collective benchmarks to study the behavior of the new DC transport and understand the performance / memory trade-offs it presents. We then use this knowledge to propose multiple designs for MPI over DC. We evaluate an implementation of our design in the MVAPICH2 MPI library using standard MPI benchmarks and applications. To the best of our knowledge, this is the first such design of an MPI library over the new DC transport. Our experimental results at the microbenchmark level show that the DC-based design in MVAPICH2 is able to deliver 42% and 43% improvement in latency for large message All-to-one exchanges over XRC and RC respectively. DC-based designs are also able to give 20% and 8% improvement for small message One-to-all exchanges over RC and XRC respectively. For the All-to-all communication pattern, DC is able to deliver performance comparable to RC/XRC while outperforming in memory consumption. At the application level, for NAMD on 620 processes, the DC-based designs in MVAPICH2 outperform designs based on RC, XRC, and UD by 22%, 10%, and 13% respectively in execution time. With DL-POLY, DC outperforms RC and XRC by 75% and 30%, respectively, in total completion time while delivering performance similar to UD.
ieee international conference on high performance computing, data, and analytics | 2014
Rong Shi; Sreeram Potluri; Khaled Hamidouche; Jonathan L. Perkins; Mingzhe Li; Davide Rossetti; Dhabaleswar K. Panda
Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.
ieee international conference on high performance computing data and analytics | 2015
Akshay Venkatesh; Abhinav Vishnu; Khaled Hamidouche; Nathan R. Tallent; Dhabaleswar K. Panda; Darren J. Kerbyson; Adolfy Hoisie
Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy and power savings, if an appropriate power reduction technique such as core-idling/Dynamic Voltage and Frequency Scaling (DVFS) can be applied without affecting the applications performance. Existing techniques that exploit slack for power savings assume that application behavior repeats across iterations/executions. However, an increasing use of adaptive and data-dependent workloads combined with system factors (OS noise, congestion) negates this assumption. This paper proposes and implements Energy Aware MPI (EAM) --- an application-oblivious energy-efficient MPI runtime. EAM uses a combination of communication models for common MPI primitives (point-to-point, collective, progress, blocking/non-blocking) and an online observation of slack to maximize energy efficiency and to honor performance degradation limits. Each power lever incurs time overhead, which must be amortized over slack to minimize degradation. When predicted communication time exceeds a lever overhead, the lever is used as soon as possible --- to maximize energy efficiency. When a misprediction occurs, the lever(s) are used automatically at specific intervals for amortization. We implement EAM using MVAPICH2 and evaluate it on ten applications using up to 4,096 processes. Our performance evaluation on an InfiniBand cluster indicates that EAM can reduce energy consumption by 5-41% in comparison to the default approach, which prioritizes performance alone, with negligible (less than 4% in all cases) performance loss.
international conference on cluster computing | 2013
Rong Shi; Sreeram Potluri; Khaled Hamidouche; Xiaoyi Lu; Karen Tomko; Dhabaleswar K. Panda
Accelerating High-Performance Linkpack (HPL) on heterogeneous clusters with multi-core CPUs and GPUs has attracted a lot of attention from the High Performance Computing community. It is becoming common for large scale clusters to have GPUs on only a subset of nodes in order to limit system costs. The major challenge for HPL in this case is to efficiently take advantage of all the CPU and GPU resources available on a cluster. In this paper, we present a novel two-level workload partitioning approach for HPL that distributes workload based on the compute power of CPU/GPU nodes across the cluster. Our approach also handles multi-GPU configurations. Unlike earlier approaches for heterogeneous clusters with CPU and GPU nodes, our design takes advantage of asynchronous kernel launches and CUDA copies to overlap computation and CPU-GPU data movement. It uses techniques such as process grid reordering to reduce MPI communication/contention while ensuring load balance across nodes. Our experimental results using 32 GPU and 128 CPU nodes of Oakley, a research cluster at Ohio Supercomputer Center, shows that our proposed approach can achieve more than 80% of combined actual peak performance of CPU and GPU nodes. This provides 47% and 63% increase in the HPL performance that can be reported using only CPU nodes and only GPU nodes, respectively.
