Abdullah Kayi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abdullah Kayi is active.

Explore More

Publication

Featured researches published by Abdullah Kayi.

international parallel and distributed processing symposium | 2007

Experimental Evaluation of Emerging Multi-core Architectures

Abdullah Kayi; Yiyi Yao; Tarek A. El-Ghazawi; Gregory B. Newby

The trend of increasing speed and complexity in the single-core processor as stated in the Moores law is facing practical challenges. As a result, the multi-core processor architecture has emerged as the dominant architecture for both desktop and high-performance systems. Multi-core systems introduce many challenges that need to be addressed to achieve the best performance. Therefore, a new set of benchmarking techniques to study the impacts of the multi-core technologies is necessary. In this paper, multi-core specific performance metrics for cache coherency and memory bandwidth/latency/contention are investigated. This study also proposes a new benchmarking suite which includes cases extended from the high performance computing challenge (HPCC) benchmark suite. Performance results are measured on a Sun Fire T1000 server with six cores and an AMD Opteron dual core system. Experimental analysis and observations in this paper provide for a better understanding of the emerging multi-core architectures.

Simulation Modelling Practice and Theory | 2009

Performance issues in emerging homogeneous multi-core architectures

Abdullah Kayi; Tarek A. El-Ghazawi; Gregory B. Newby

Abstract Multi-core architectures have emerged as the dominant architecture for both desktop and high-performance systems. Multi-core systems introduce many challenges that need to be addressed to achieve the best performance. Therefore, benchmarking of these processors is necessary to identify the possible performance issues. In this paper, broad range of homogeneous multi-core architectures are investigated in terms of essential performance metrics. To measure performance, we used micro-benchmarks from High-Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), LMbench, and an FFT benchmark. Performance analysis is conducted on multi-core systems from UltraSPARC and x86 architectures; including systems based on Conroe, Kentsfield, Clovertown, Santa Rosa, Barcelona, Niagara, and Victoria Falls processors. Also, the effect of multi-core architectures in cluster performance is examined using a Clovertown based cluster. Finally, cache coherence overhead is analyzed using a full-system simulator. Experimental analysis and observations in this study provide for a better understanding of the emerging homogeneous multi-core systems.

ieee international conference on high performance computing, data, and analytics | 2016

Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels

Rob F. Van der Wijngaart; Abdullah Kayi; Jeff R. Hammond; Gabriele Jost; Tom St. John; Srinivas Sridharan; Timothy G. Mattson; John Abercrombie; Jacob Nelson

We use three Parallel Research Kernels to compare performance of a set of programming models(We employ the term programming model as it is commonly used in the application community. A more accurate term is programming environment, which is the collective of abstract programming model, embodiment of the model in an Application Programmer Interface (API), and the runtime that implements it.): MPI1 (MPI two-sided communication), MPIOPENMP (MPI+OpenMP), MPISHM (MPI1 with MPI-3 interprocess shared memory), MPIRMA (MPI one-sided communication), SHMEM, UPC, Charm++ and Grappa. The kernels in our study – Stencil, Synch_p2p and Transpose – underlie a wide range of computational science applications. They enable direct probing of properties of programming models, especially communication and synchronization. In contrast to mini- or proxy applications, the PRK allow for rapid implementation, measurement and verification. Our experimental results show MPISHM the overall winner, with MPI1, MPIOPENMP and SHMEM performing well. MPISHM and MPIOPENMP outperform the other models in the strong-scaling limit due to their effective use of shared memory and good granularity control. The non-evolutionary models Grappa and Charm++ are not competitive with traditional models (MPI and PGAS) for two of the kernels; these models favor irregular algorithms, while the PRK considered here are regular.

computational science and engineering | 2008

Application Performance Tuning for Clusters with ccNUMA Nodes

Abdullah Kayi; Edward Kornkven; Tarek A. El-Ghazawi; Gregory B. Newby

With the increasing trend of putting more cores inside a single chip, more clusters adapt multicore multiprocessor nodes for high-performance computing (HPC). Cache coherent non-uniform memory access architectures (ccNUMA) are becoming an increasingly popular choice for such systems. In this paper, application performance analysis is provided using a 2312 Opteron cores system based on Sun Fire servers. Performance bottlenecks are identified and some potential solutions are proposed. With the proposed performance tunings, up to 30% application performance improvement was observed. In addition, provided experimental analysis can be utilized by HPC application developers in order to better understand clusters with ccNUMA nodes and also as a guideline for the usage of such architectures for scientific computing.

high performance computing and communications | 2008

Performance Evaluation of Clusters with ccNUMA Nodes - A Case Study

Abdullah Kayi; Edward Kornkven; Tarek A. El-Ghazawi; Samy Al-Bahra; Gregory B. Newby

In the quest for higher performance and with the increasing availability of multicore chips, many systems are currently packing more processors per node. Adopting a ccNUMA node architecture in these cases has the promise of achieving a balance between cost and performance. In this paper, a 2312 Opteron cores system based on Sun Fire servers is considered as a case study to examine the performance issues associated with such architectures. In this work, we characterize the performance behavior of the system with focus on the node level using different configurations. It will be shown that the benefits from larger nodes can be severely limited due to many reasons. These reasons were isolated and the associated performance losses were assessed. The results revealed that such problems were mainly caused by topological imbalances, limitations of the used cache coherency protocol, operating system services distribution, and the lack of intelligent management of memory affinity.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Olivier Serres; Ahmad Anbar; Saumil G. Merchant; Abdullah Kayi; Tarek A. El-Ghazawi

Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi- and many-core processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a many-core system, the TILE64 (a 64 core processor) and a dual-socket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies | 2010

An adaptive cache coherence protocol for chip multiprocessors

Abdullah Kayi; Tarek A. El-Ghazawi

Multi-core architectures also referred to as Chip Multiprocessors (CMPs) have emerged as the dominant architecture for both desktop and high-performance systems. CMPs introduce many challenges that need to be addressed to achieve the best performance. One of the big challenges comes with the shared-memory model observed in such architectures which is the cache coherence overhead problem. Contemporary architectures employ write-invalidate based protocols which are known to generate coherence misses that yield to latency issues. On the other hand, write-update based protocols can solve the coherence misses problem but they tend to generate excessive network traffic which is especially not desirable for CMPs. Previous studies have shown that a single protocol approach is not sufficient for many sharing patterns. As a solution, this paper evaluates an adaptive protocol which targets write-update optimizations for producer-consumer sharing patterns. This work targets a minimalistic hardware extension approach to test the benefits of such adaptive protocols in a practical environment. Experimental study is conducted on a 16-core CMP by using a full-system simulator with selected scientific applications from SPLASH-2 and NAS parallel benchmark suites. Results show up to 40% improvement for coherence misses which corresponds to 15% application speedup.

IEEE Transactions on Computers | 2015

Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip Multiprocessors

Abdullah Kayi; Olivier Serres; Tarek A. El-Ghazawi

In chip multiprocessors (CMPs), maintaining cache coherence can account for a major performance overhead. Write-invalidate protocols adapted by most CMPs generate high cache-to-cache misses under producer-consumer sharing patterns. Accordingly, this paper presents three cache coherence mechanisms optimized for CMPs. First, to reduce coherence misses observed in write-invalidate-based protocols, we propose a dynamic write-update mechanism augmented on top of a write-invalidate protocol. This mechanism is specifically triggered at the detection of a producer-consumer sharing pattern. Second, we extend this adaptive protocol with a bandwidth-adaptive mechanism to eliminate performance degradation from write-updates under limited bandwidth. Finally, proximity-aware mechanism is proposed to extend the base adaptive protocol with latency-based optimizations. Experimental analysis is conducted on a set of scientific applications from the SPLASH-2 and NAS parallel benchmark suites. The proposed mechanisms were shown to reduce coherence misses by up to 48% and in return speed up application performance up to 30%. Bandwidth-adaptive mechanism is proven to perform well under varying levels of available bandwidth. Results from our proposed proximity-aware extension demonstrated up to 6% performance gain over the base adaptive protocol for 64-core tiled CMP runs. In addition, the analytical model provided good estimates for performance gains from our adaptive protocols.

high performance computing and communications | 2014

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

Olivier Serres; Abdullah Kayi; Ahmad Anbar; Tarek A. El-Ghazawi

The Partitioned Global Address Space (PGAS) programming model strikes a balance between the locality-aware, but explicit, message-passing model (e.g. MPI) and the easy-to-use, but locality-agnostic, shared memory model (e.g. OpenMP). However, the PGAS rich memory model comes at a performance cost which can hinder its potential for scalability and performance. To contain this overhead and achieve full performance, compiler optimizations may not be sufficient and manual optimizations are typically added. This, however, can severely limit the productivity advantage. Such optimizations are usually targeted at reducing address translation overheads for shared data structures. This paper proposes a hardware architectural support for PGAS, which allows the processor to efficiently handle shared addresses. This eliminates the need for such hand-tuning, while maintaining the performance and productivity of PGAS languages. We propose to avail this hardware support to compilers by introducing new instructions to efficiently access and traverse the PGAS memory space. A prototype compiler is realized by extending the Berkeley Unified Parallel C (UPC) compiler. It allows unmodified code to use the new instructions without the user intervention, thereby creating a real productive programming environment. Two different implementations of the system are realized: the first is implemented using the full system simulator Gem5, which allows the evaluation of the performance gain. The second is implemented using a soft core processor Leon3 on an FPGA to verify the implement ability and to parameterize the cost of the new hardware and its instructions. The new instructions show promising results for the NAS Parallel Benchmarks implemented in UPC. A speedup of up to 5.5x is demonstrated for unmodified codes. Unmodified code performance using this hardware was shown to also surpass the performance of manually optimized code by up to 10%.

computing frontiers | 2014

Hardware support for address mapping in PGAS languages: a UPC case study

Olivier Serres; Abdullah Kayi; Ahmad Anbar; Tarek A. El-Ghazawi

The Partitioned Global Address Space (PGAS) programming model strikes a balance between the explicit, locality-aware, message-passing model and locality-agnostic, but easy-to-use, shared memory model (e.g. OpenMP). However, the PGAS memory model comes at a performance cost which limits both scalability and performance. Compiler optimizations are often not sufficient and manual optimizations are needed which considerably limit the productivity advantage. This paper proposes a hardware architectural support for PGAS, which allows the processor to efficiently handle shared addresses through new instructions. A prototype compiler is realized allowing to use the support with unmodified code, preserving the PGAS productivity advantage. Speedups of up to 5.5x are demonstrated on the unmodified NAS Parallel Benchmarks using the Gem5 full system simulator.

Explore More