Is this you? Create Your Porfile

Khaled Z. Ibrahim

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Khaled Z. Ibrahim is active.

Explore More

Publication

Featured researches published by Khaled Z. Ibrahim.

ieee international conference on high performance computing data and analytics | 2011

Optimized pre-copy live migration for memory intensive applications

Khaled Z. Ibrahim; Steven A. Hofmeyr; Costin Iancu; Eric Roman

Live migration is a widely used technique for resource consolidation and fault tolerance. KVM and Xen use iterative pre-copy approaches which work well in practice for commercial applications. In this paper, we study pre-copy live migration of MPI and OpenMP scientific applications running on KVM and present a detailed performance analysis of the migration process. We show that due to a high rate of memory changes, the current KVM rate control and target downtime heuristics do not cope well with HPC applications: statically choosing rate limits and downtimes is infeasible and current mechanisms sometimes provide sub-optimal performance. We present a novel on-line algorithm able to provide minimal downtime and minimal impact on end-to-end application performance. At the core of this algorithm is controlling migration based on the application memory rate of change.

ieee international conference on high performance computing data and analytics | 2011

Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems

Kamesh Madduri; Khaled Z. Ibrahim; Samuel Williams; Eun-Jin Im; Stephane Ethier; John Shalf; Leonid Oliker

The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation re- search. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microturbulence in tokamak devices. Our optimizations encompass all six GTC sub-routines and include multi-level particle and grid decompositions designed to improve multi-node parallel scaling, particle binning for improved load balance, GPU acceleration of key subroutines, and memory-centric optimizations to improve single-node scaling and reduce memory utilization. The new hybrid MPI-OpenMP and MPI-OpenMP-CUDA GTC versions achieve up to a 2× speedup over the production Fortran code on four parallel systems - clusters based on the AMD Magny-Cours, Intel Nehalem-EP, IBM BlueGene/P, and NVIDIA Fermi architectures. Finally, strong scaling experiments provide insight into parallel scalability, memory utilization, and programmability trade-offs for large-scale gyrokinetic PIC simulations, while attaining a 1.6× speedup on 49,152 XE6 cores.

parallel computing | 2011

Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Kamesh Madduri; Eun-Jin Im; Khaled Z. Ibrahim; Samuel Williams; Stephane Ethier; Leonid Oliker

The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTCs key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7x on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.

high performance distributed computing | 2016

Scaling Spark on HPC Systems

Nicholas Chaimov; Allen D. Malony; Shane Canon; Costin Iancu; Khaled Z. Ibrahim; Jay Srinivasan

We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.

international conference on supercomputing | 2012

Congestion avoidance on manycore high performance computing systems

Miao Luo; Dhabaleswar K. Panda; Khaled Z. Ibrahim; Costin Iancu

Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40\% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve all-to-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60\% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.

ieee international conference on high performance computing data and analytics | 2013

Kinetic turbulence simulations at extreme scale on leadership-class systems

Bei Wang; Stephane Ethier; William Tang; Timothy J. Williams; Khaled Z. Ibrahim; Kamesh Madduri; Samuel Williams; Leonid Oliker

Reliable predictive simulation capability addressing confinement properties in magnetically confined fusion plasmas is critically-important for ITER, a 20 billion dollar international burning plasma device under construction in France. The complex study of kinetic turbulence, which can severely limit the energy confinement and impact the economic viability of fusion systems, requires simulations at extreme scale for such an unprecedented device size. Our newly optimized, global, ab initio particle-in-cell code solving the nonlinear equations underlying gyrokinetic theory achieves excellent performance with respect to “time to solution” at the full capacity of the IBM Blue Gene/Q on 786,432 cores of Mira at ALCF and recently of the 1,572,864 cores of Sequoia at LLNL. Recent multithreading and domain decomposition optimizations in the new GTC-P code represent critically important software advances for modern, low memory per core systems by enabling routine simulations at unprecedented size (130 million grid points ITER-scale) and resolution (65 billion particles).

Lawrence Berkeley National Laboratory | 2010

TORCH Computational Reference Kernels - A Testbed for Computer Science Research

Alex Kaiser; Samuel Williams; Kamesh Madduri; Khaled Z. Ibrahim; David H. Bailey; James Demmel; Erich Strohmaier

For decades, computer scientists have sought guidance on how to evolve architectures, languages, and programming models in order to improve application performance, efficiency, and productivity. Unfortunately, without overarching advice about future directions in these areas, individual guidance is inferred from the existing software/hardware ecosystem, and each discipline often conducts their research independently assuming all other technologies remain fixed. In todays rapidly evolving world of on-chip parallelism, isolated and iterative improvements to performance may miss superior solutions in the same way gradient descent optimization techniques may get stuck in local minima. To combat this, we present TORCH: A Testbed for Optimization ResearCH. These computational reference kernels define the core problems of interest in scientific computing without mandating a specific language, algorithm, programming model, or implementation. To compliment the kernel (problem) definitions, we provide a set of algorithmically-expressed verification tests that can be used to verify a hardware/software co-designed solution produces an acceptable answer. Finally, to provide some illumination as to how researchers have implemented solutions to these problems in the past, we provide a set of reference implementations in C and MATLAB.

ieee international conference on high performance computing data and analytics | 2013

Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms

Khaled Z. Ibrahim; Kamesh Madduri; Samuel Williams; Bei Wang; Stephane Ethier; Leonid Oliker

The Gyrokinetic Toroidal Code (GTC) uses the particle-in-cell method to efficiently simulate plasma microturbulence. This work presents novel analysis and optimization techniques to enhance the performance of GTC on large-scale machines. We introduce cell access analysis to better manage locality vs. synchronization tradeoffs on CPU and GPU-based architectures. Our optimized hybrid parallel implementation of GTC uses MPI, OpenMP, and NVIDIA CUDA, achieves up to a 2× speedup over the reference Fortran version on multiple parallel systems, and scales efficiently to tens of thousands of cores.

international parallel and distributed processing symposium | 2014

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Khaled Z. Ibrahim; Paul Hargrove; Costin Iancu; Katherine A. Yelick

The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism, 2) message ordering constraints, 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one- and two-sided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6x when compared with strict ordering. When hardware allows it, high-level one-sided programming models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, we argue that exposing out-of-order delivery at the application level is required for the next-generation programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput.

international conference on parallel processing | 2010

Characterizing the Relation Between Apex-Map Synthetic Probes and Reuse Distance Distributions

Khaled Z. Ibrahim; Erich Strohmaier

Characterizing a memory reference stream using reuse distance distribution can enable predicting the performance on a given architecture. Benchmarks can subject an architecture to a limited set of reuse distance distributions, but it cannot exhaustively test it. In contrast, Apex-Map, a synthetic memory probe with parameterized locality, can provide a better coverage of the machine use scenarios. Unfortunately, it requires a lot of expertise to relate an application memory behavior to an Apex-Map parameter set. In this work we present a mathematical formulation that describes the relation between Apex-Map and reuse distance distributions. We also introduce a process through which we can automate the estimation of Apex-Map locality parameters for a given application. This process finds the best parameters for Apex-Map probes that generate a reuse distance distribution similar to that of the original application. We tested this scheme on benchmarks from Scalable Synthetic Compact Applications and Unbalanced Tree Search, and we show that this scheme provides an accurate Apex-Map parameterization with a small percentage of mismatch in reuse distance distributions, about 3% in average and less than 8% in the worst case, on the tested applications.

Explore More