Abdel-Hameed A. Badawy

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abdel-Hameed A. Badawy is active.

Explore More

Publication

Featured researches published by Abdel-Hameed A. Badawy.

international conference on supercomputing | 2001

Evaluating the impact of memory system performance on software prefetching and locality optimizations

Abdel-Hameed A. Badawy; Aneesh Aggarwal; Donald Yeung; Chau-Wen Tseng

Software prefetching and locality optimizations are techniques for overcoming the speed gap between processor and memory. In this paper, we evaluate the impact of memory trends on the effectiveness of software prefetching and locality optimizations for three types of applications: regular scientific codes, irregular scientific codes, and pointer-chasing codes. We find for many applications, software prefetching outperforms locality optimizations when there is sufficient memory bandwidth, but locality optimizations outperform software prefetching under bandwidth-limited conditions. The break-even point (for 1 Ghz processors) occurs at roughly 2.5 GBytes/sec on todays memory systems, and will increase on future memory systems. We also study the interactions between software prefetching and locality optimizations when applied in concert. Naively combining the techniques provides robustness to changes in memory bandwidth and latency, but does not yield additional performance gains. We propose and evaluate several algorithms to better integrate software prefetching and locality optimizations, including a modified tiling algorithm, padding for prefetching, and index prefetching.

IEEE Photonics Journal | 2015

The Case for Hybrid Photonic Plasmonic Interconnects (HyPPIs): Low-Latency Energy-and-Area-Efficient On-Chip Interconnects

Shuai Sun; Abdel-Hameed A. Badawy; Vikram K. Narayana; Tarek A. El-Ghazawi; Volker J. Sorger

Moores law for traditional electric integrated circuits is facing increasingly more challenges in both physics and economics. Among those challenges is the fact that the bandwidth per compute on the chip is dropping, whereas the energy needed for data movement keeps rising. We benchmark various interconnect technologies, including electrical, photonic, and plasmonic options. We contrast them with hybrid photonic-plasmonic interconnect(s) [HyPPI(s)], where we consider plasmonics for active manipulation devices and photonics for passive propagation integrated circuit elements and further propose another novel hybrid link that utilizes an on-chip laser for intrinsic modulation, thus bypassing electrooptic modulation. Our analysis shows that such hybridization will overcome the shortcomings of both pure photonic and plasmonic links. Furthermore, it shows superiority in a variety of performance parameters such as point-to-point latency, energy efficiency, throughput, energy delay product, crosstalk coupling length, and bit flow density, which is a new metric that we defined to reveal the tradeoff between the footprint and performance. Our proposed HyPPIs show significantly superior performance compared with other links.

Microprocessors and Microsystems | 2017

MorphoNoC: Exploring the design space of a configurable hybrid NoC using nanophotonics

Vikram K. Narayana; Shuai Sun; Abdel-Hameed A. Badawy; Volker J. Sorger; Tarek A. El-Ghazawi

Abstract As diminishing feature sizes drive down the energy for computations, the power budget for on-chip communication is steadily rising. Furthermore, the increasing number of cores is placing a huge performance burden on the network-on-chip (NoC) infrastructure. While NoCs are designed as regular architectures that allow scaling to hundreds of cores, the lack of a flexible topology gives rise to higher latencies, lower throughput, and increased energy costs. In this paper, we explore MorphoNoCs - scalable, configurable, hybrid NoCs obtained by extending regular electrical networks with configurable nanophotonic links. In order to design MorphoNoCs, we first carry out a detailed study of the design space for Multi-Write Multi-Read (MWMR) nanophotonics links. After identifying optimum design points, we then discuss the router architecture for deploying them in hybrid electronic-photonic NoCs. We then study the design space at the network level, by varying the waveguide lengths and the number of hybrid routers. This affords us to carry out energy-latency trade-offs. For our evaluations, we adopt traces from synthetic benchmarks as well as the NAS Parallel Benchmark suite. Our results indicate that MorphoNoCs can achieve latency improvements of up to 3.0× or energy improvements of up to 1.37× over the base electronic network.

ACM Transactions on Architecture and Code Optimization | 2016

Exploiting Hierarchical Locality in Deep Parallel Architectures

Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi

Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this article, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime that takes into account machine characteristics, as well as a data sharing and communication profile of the underlying application. This article presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale up and achieve performance gains of up to 88%.

international performance computing and communications conference | 2016

LMStr: Local memory store the case for hardware controlled scratchpad memory for general purpose processors

Nafiul Alam Siddique; Abdel-Hameed A. Badawy; Jeanine Cook; David Resnick

In this paper, we present a hardware controlled on-chip memory called Local Memory Store (LMStr) that can be used either solely as a scratchpad or as a combination of scratchpad and cache, storing any variable specified by the programmer or extracted by the compiler. LMStr is different than a traditional scratchpad in that it is hardware-controlled and it stores the same type of variables in a block that is allocated based on availability and demand. In this initial work on LMStr, we focus on identifying the potential for LMStr, namely, the advantages of storing temporary and program variables in blocks in LMStr and comparing the performance against a regular cache. To the best of our knowledge, this is the first work where scratchpad has been used in a generalized way where the focus is on storing temporary and programmer specified variables in blocks. We evaluate LMStr on a micro-benchmark and a set of the mini-applications in the mantevo suite. We simulate LMStr in the Structural Simulation Toolkit (SST) simulator. LMStr provides a 10% reduction in average data movement between on-chip and off-chip memory compared to a traditional cache hierarchy.

international conference on computational science | 2016

Cache Utilization as a Locality Metric - A Case Study on the Mantevo Suite

Nafiul Alam Siddique; Patricia Grubel; Abdel-Hameed A. Badawy; Jeanine Cook

Cache hierarchies have long been utilized to minimize the latency of main memory accesses by caching frequently used data closer to the processor. Significant research has been done to identify the most crucial metrics of cache performance. Though the majority of research focuses on measuring cache hit rates and data movement as the major cache performance metrics, cache utilization can be equally important. In this work, we present cache utilization performance metrics that provide insight into application behavior. We define cache utilization in two forms: 1) the fraction of data bytes in a cache line that are actually accessed at least once before eviction from cache and 2) the access frequency of data bytes in a cache line. We discuss the relationship between the utilization measurement and two important application properties: 1) spatial locality – the use of data located near data that has already been accessed, and 2) temporal locality – the reuse of data over time. In addition to measuring cache line utilization performance, we present conventional performance metrics as well to illustrate a holistic understanding of cache behavior. To facilitate this work, we build a memory simulator incorporated into the Structural Simulation Toolkit (SST). We measure and analyze the performance for several scientific mini-applications from the Mantevo suite [1]. This work justifies that caches are not necessarily the best onchip solution for all types of applications due to the fixed cache line size.

Journal of Microbiology & Biology Education | 2013

A Survey Tool for Assessing Student Expectations Early in a Semester

Karl R. B. Schmitt; Elise A. Larsen; Matthew W. Miller; Abdel-Hameed A. Badawy; Mara Dougherty; Artesha Taylor Sharma; Katie M. Hrapczynski; Andrea A. Andrew; Breanne Robertson; Alexis Y. Williams; Sabrina Kramer; Spencer Benson

Quality learning is fostered when faculty members are aware of and address student expectations for course learning activities and assessments. However, faculty often have difficulty identifying and addressing student expectations given variations in students’ backgrounds, experiences, and beliefs about education. Prior research has described significant discrepancies between student and faculty expectations that result from cultural backgrounds (1), technological expertise (2), and ‘teaching dimensions’ as described by Trudeau and Barnes (4). Such studies illustrate the need for tools to identify and index student expectations, which can be used to facilitate a dialogue between instructor and students. Here we present the results of our work to develop, refine, and deploy such a tool.

ieee annual computing and communication workshop and conference | 2017

StAdHyTM: A Statically Adaptive Hybrid Transactional Memory: A scalability study on large parallel graphs

Mohammd Abdul Qayum; Abdel-Hameed A. Badawy; Jeanine Cook

In this paper, we present a Statically Adaptive Hybrid Transactional Memory (StAdHyTM) that outperforms not only existing Hardware TM (HTM) and Software TM (STMs) but also common synchronization schemes such as locks. StAdHyTM is statically tuned to adapt to the application behavior to improve the performance. We focus in particular on large parallel graph applications. Our StAdHyTM implementation outperforms coarse-grain locks by up to 8.1× and STM by up to 2.6× in total execution time for computation kernel in the SSCA-2 benchmark. It also outperforms HTM by up to 2.1× on a 28-cores and 64GB machine. We tested large graphs of up to 268 million vertices and 2.147 billion edges on a 64-core and 128 GB machine. To the best of our knowledge, this work is the first scalability study of synchronizations involving all the TM implementations- HTM, STM, HyTM, and Adaptive HyTM.

2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015

PHLAME: Hierarchical Locality Exploitation Using the PGAS Model

Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi

Parallel computers are becoming deeply hierarchical. Locality aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this work, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime system that takes into account machine characteristics, data sharing and communication profile of the underlying application. This paper presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale to 1024 cores and achieve performance gains of up to 88%.

international conference on parallel and distributed systems | 2014

Where should the threads go? Leveraging hierarchical data locality to solve the thread affinity dilemma

Ahmad Anbar; Abdel-Hameed A. Badawy; Olivier Serres; Tarek A. El-Ghazawi

We are proposing a novel framework that ameliorates locality-aware parallel programming models, by defining a hierarchical data locality model extension. We also propose two hierarchical thread partitioning algorithms. These algorithms synthesize hierarchical thread placement layouts that targets minimizing the programs overall communication costs. We demonstrate the effectiveness of our approach using the NAS Parallel Benchmarks implemented in Unified Parallel C (UPC) using a modified Berkeley UPC Compiler and runtime system. We achieved performance gains of up to 88% in performance by applying the placement layouts our algorithms suggest.

Explore More