Ahmad Anbar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ahmad Anbar is active.

Explore More

Publication

Featured researches published by Ahmad Anbar.

ieee aerospace conference | 2011

Experiences with UPC on TILE-64 processor

Olivier Serres; Ahmad Anbar; Saumil G. Merchant; Tarek A. El-Ghazawi

Partitioned global address space (PGAS) programming model presents programmers with a globally shared address space with locality awareness and one-sided communication constructs. The shared address space and the one-sided communication constructs enhance ease-of-use of PGAS based languages and the locality awareness enables programmers and the runtime systems to achieve higher performance. Thus PGAS programming model may help address the escalating software complexity issues resulting from the proliferation of many-core processor architectures in aerospace and computing systems in general. This paper presents our experiences with Unified parallel C (UPC), a PGAS language, on the Tile64™ processor, a 64-core processor from Tilera Corporation. We ported Berkeley UPC compiler and runtime system on the Tilera architecture and evaluated two separate runtime implementation conduits of the underlying GASNet communication library, a pThreads based conduit and an MPI based conduit. Each conduit uses different on-chip, inter-core communication networks providing different latencies and bandwidths for inter-process communications. The paper presents the implementation details and empirical analyses of both approaches by comparing and evaluating results from NAS Parallel Benchmark suite. The analyses reveal various optimization opportunities based on specific many-core architectural features which are also discussed in the paper12.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Olivier Serres; Ahmad Anbar; Saumil G. Merchant; Abdullah Kayi; Tarek A. El-Ghazawi

Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi- and many-core processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a many-core system, the TILE64 (a 64 core processor) and a dual-socket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.

international parallel and distributed processing symposium | 2016

PGAS Access Overhead Characterization in Chapel

Engin Kayraklioglu; Olivier Serres; Ahmad Anbar; Hashem Elezabi; Tarek A. El-Ghazawi

The Partitioned Global Address Space (PGAS) model, increases programmer productivity by presenting a flat memory space with locality awareness. However, the abstract representation of memory incurs overheads especially when global data is accessed. As a PGAS programming language, Chapel provides language structures to alleviate such overheads. In this work, we examine such optimizations on a set of benchmarks using multiple locales and analyze their impact on programmer productivity quantitatively. The optimization methods that we study achieved improvements over non-optimized versions ranging from 1.1 to 68.1 times depending on the benchmark characteristic.

ACM Transactions on Architecture and Code Optimization | 2016

Exploiting Hierarchical Locality in Deep Parallel Architectures

Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi

Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this article, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime that takes into account machine characteristics, as well as a data sharing and communication profile of the underlying application. This article presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale up and achieve performance gains of up to 88%.

2013 International Green Computing Conference Proceedings | 2013

Granular CPU power measurement for SMP clusters

David K. Newsom; Sardar F. Azari; Ahmad Anbar; Tarek A. El-Ghazawi

One of the key challenges in optimizing CPU power consumption at the program level is the difficulty of precisely measuring the power consumption of the CPU (as distinct from other system components) during the various phases of a programs execution. This paper presents a scalable CPU power measurement framework with its associated reporting and data acquisition components that can be used to accurately measure program-related CPU power consumption on symmetric multiprocessor (SMP) cluster systems.

2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015

PHLAME: Hierarchical Locality Exploitation Using the PGAS Model

Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi

Parallel computers are becoming deeply hierarchical. Locality aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this work, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime system that takes into account machine characteristics, data sharing and communication profile of the underlying application. This paper presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale to 1024 cores and achieve performance gains of up to 88%.

international conference on parallel and distributed systems | 2014

Where should the threads go? Leveraging hierarchical data locality to solve the thread affinity dilemma

Ahmad Anbar; Abdel-Hameed A. Badawy; Olivier Serres; Tarek A. El-Ghazawi

We are proposing a novel framework that ameliorates locality-aware parallel programming models, by defining a hierarchical data locality model extension. We also propose two hierarchical thread partitioning algorithms. These algorithms synthesize hierarchical thread placement layouts that targets minimizing the programs overall communication costs. We demonstrate the effectiveness of our approach using the NAS Parallel Benchmarks implemented in Unified Parallel C (UPC) using a modified Berkeley UPC Compiler and runtime system. We achieved performance gains of up to 88% in performance by applying the placement layouts our algorithms suggest.

high performance computing and communications | 2014

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

Olivier Serres; Abdullah Kayi; Ahmad Anbar; Tarek A. El-Ghazawi

The Partitioned Global Address Space (PGAS) programming model strikes a balance between the locality-aware, but explicit, message-passing model (e.g. MPI) and the easy-to-use, but locality-agnostic, shared memory model (e.g. OpenMP). However, the PGAS rich memory model comes at a performance cost which can hinder its potential for scalability and performance. To contain this overhead and achieve full performance, compiler optimizations may not be sufficient and manual optimizations are typically added. This, however, can severely limit the productivity advantage. Such optimizations are usually targeted at reducing address translation overheads for shared data structures. This paper proposes a hardware architectural support for PGAS, which allows the processor to efficiently handle shared addresses. This eliminates the need for such hand-tuning, while maintaining the performance and productivity of PGAS languages. We propose to avail this hardware support to compilers by introducing new instructions to efficiently access and traverse the PGAS memory space. A prototype compiler is realized by extending the Berkeley Unified Parallel C (UPC) compiler. It allows unmodified code to use the new instructions without the user intervention, thereby creating a real productive programming environment. Two different implementations of the system are realized: the first is implemented using the full system simulator Gem5, which allows the evaluation of the performance gain. The second is implemented using a soft core processor Leon3 on an FPGA to verify the implement ability and to parameterize the cost of the new hardware and its instructions. The new instructions show promising results for the NAS Parallel Benchmarks implemented in UPC. A speedup of up to 5.5x is demonstrated for unmodified codes. Unmodified code performance using this hardware was shown to also surpass the performance of manually optimized code by up to 10%.

computing frontiers | 2014

Hardware support for address mapping in PGAS languages: a UPC case study

Olivier Serres; Abdullah Kayi; Ahmad Anbar; Tarek A. El-Ghazawi

The Partitioned Global Address Space (PGAS) programming model strikes a balance between the explicit, locality-aware, message-passing model and locality-agnostic, but easy-to-use, shared memory model (e.g. OpenMP). However, the PGAS memory model comes at a performance cost which limits both scalability and performance. Compiler optimizations are often not sufficient and manual optimizations are needed which considerably limit the productivity advantage. This paper proposes a hardware architectural support for PGAS, which allows the processor to efficiently handle shared addresses through new instructions. A prototype compiler is realized allowing to use the support with unmodified code, preserving the PGAS productivity advantage. Speedups of up to 5.5x are demonstrated on the unmodified NAS Parallel Benchmarks using the Gem5 full system simulator.

2013 ACS International Conference on Computer Systems and Applications (AICCSA) | 2013

Predictive energy management techniques for PGAS programming

David K. Newsom; Sardar F. Azari; Ahmad Anbar; Tarek A. El-Ghazawi

Power consumption increasingly presents an upper bound on sustainable large scale computing performance and reliability. The Partitioned Global Address Space (PGAS) programming model is a family of parallel programming paradigms with a global address space for ease-of-use while providing locality awareness for efficient execution. Very little exploration has been done to determine the potential of PGAS programming models in improving scalable energy efficient computation for high performance computing (HPC) clusters. This paper examines features of the PGAS programming model that may support predictively reducing power consumption in distributed clusters via dynamic voltage frequency scaling (DVFS). These concepts are tested with Unified Parallel C (UPC) codes running on a cluster of commodity PCs which have been instrumented to measure power at the CPU socket level. We have also explored approaches to automating these power optimization techniques at compile time. Benchmarking results show a tangible reduction in power consumption without impacting the overall execution time of the program.

Explore More