Sameh S. Sharkawi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sameh S. Sharkawi is active.

Explore More

Publication

Featured researches published by Sameh S. Sharkawi.

international parallel and distributed processing symposium | 2009

Performance projection of HPC applications using SPEC CFP2006 benchmarks

Sameh S. Sharkawi; Don DeSota; Raj Panda; Rajeev Indukuru; Stephen Stevens; Valerie E. Taylor; Xingfu Wu

Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems, enable them to compare the application performance across different existing and future systems, and help HPC users with system procurement and application refinements. In this paper, we present a method for projecting the node level performance of HPC applications using published data of industry standard benchmarks, the SPEC CFP2006, and hardware performance counter data from one base machine. In particular, we project performance of eight HPC applications onto four systems, utilizing processors from different vendors, using data from one base machine, the IBM p575. The projected performance of the eight applications was within 7.2% average difference with respect to measured runtimes for IBM POWER6 systems and standard deviation of 5.3%. For two Intel based systems with different micro-architecture and Instruction Set Architecture (ISA) than the base machine, the average projection difference to measured runtimes was 10.5% with standard deviation of 8.2%.

international conference on parallel processing | 2008

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Xingfu Wu; Valerie E. Taylor; Charles W. Lively; Sameh S. Sharkawi

Chip multiprocessors (CMP) are widely used for high performance computing. Further, these CMPs are being configured in a hierarchical manner to compose a node in a cluster system. A major challenge to be addressed is efficient use of such cluster systems for large-scale scientific applications. In this paper, we quantify the performance gap resulting from using different number of processors per node; this information is used to provide a baseline for the amount of optimization needed when using all processors per node on CMP clusters. We conduct detailed performance analysis to identify how applications can be modified to efficiently utilize all processors per node on CMP clusters, especially focusing on two scientific applications: a 3D particle-in-cell, magnetic fusion application gyrokinetic toroidal code (GTC) and a lattice Boltzmann method for simulating fluid dynamics (LBM). In terms of refinements, we use conventional techniques such as cache blocking, loop unrolling and loop fusion, and develop hybrid methods for optimizing MPI_Allreduce and MPI_Reduce. Using these optimizations, the application performance for utilizing all processors per node was improved by up to 18.97% for GTC and 15.77% for LBM on up to 2048 total processors on the CMP clusters.

international parallel and distributed processing symposium | 2012

SWAPP: A Framework for Performance Projections of HPC Applications Using Benchmarks

Sameh S. Sharkawi; Don DeSota; Raj Panda; Stephen Stevens; Valerie E. Taylor; Xingfu Wu

Surrogate-based Workload Application Performance Projection (SWAPP) is a framework for performance projections of High Performance Computing (HPC) applications using benchmark data. Performance projections of HPC applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement. SWAPP assumes that one has access to a base system and only benchmark data for a target system, the target system is not available for running the HPC application. Projections are developed using the performance profiles of the benchmarks and application on the base system and the benchmark data for the target system. SWAPP projects the performances of compute and communication components separately then combine the two projections to get the full application projection. In this paper SWAPP was used to project the performance of three NAS Multi-Zone benchmarks onto three systems (an IBM POWER6 575 cluster and an IBM Intel West mere x5670 both using an Infiniband interconnect and an IBM Blue Gene/P with a 3D Torus and Collective Tree interconnects), the base system is an IBM POWER5+ 575 cluster. The projected performance of the three benchmarks was within 11.44% average error magnitude and standard deviation of 2.64% for the three systems.

international parallel and distributed processing symposium | 2016

Optimization and Analysis of MPI Collective Communication on Fat-Tree Networks

Sameer Kumar; Sameh S. Sharkawi; K A Nysal Jan

We explore new collective algorithms to optimize MPIBcast, MPIReduce and MPIAllreduce on InfiniBand clusters. Our algorithms are specifically designed for fat-tree networks. We present multi-color k-ary trees with a novel mapping scheme to map the colors to fat-tree network nodes. Our multi-color tree algorithms result in better utilization of network links over traditional algorithms on fat-tree networks. We also present optimizations for clusters of SMP nodes as we explore both hybrid and Multi Leader SMP techniques to achieve the best performance. We show the benefits of our algorithms with performance results from micro-benchmarks on POWER8 and X86 InfiniBand clusters. We also show performance optimizations from our algorithms in the PARATEC and QBOX applications.

Proceedings of the 23rd European MPI Users' Group Meeting on | 2016

Optimization of Message Passing Services on POWER8 InfiniBand Clusters

Sameer Kumar; Robert S. Blackmore; Sameh S. Sharkawi; K A Nysal Jan; Amith R. Mamidala; T. J. Chris Ward

We present scalability and performance enhancements to MPI libraries on POWER8 InfiniBand clusters. We explore optimizations in the Parallel Active Messaging Interface (PAMI) libraries. We bypass IB VERBS via low level inline calls resulting in low latencies and high message rates. MPI is enabled on POWER8 by extension of both MPICH and Open MPI to call PAMI libraries. The IBM POWER8 nodes have GPU accelerators to optimize floating throughput of the node. We explore optimized algorithms for GPU-to-GPU communication with minimal processor involvement. We achieve a peak MPI message rate of 186 million messages per second. We also present scalable performance in the QBOX and AMG applications.

Scalable Computing: Practice and Experience | 2001