Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Engin Kayraklioglu is active.

Publication


Featured researches published by Engin Kayraklioglu.


international parallel and distributed processing symposium | 2016

PGAS Access Overhead Characterization in Chapel

Engin Kayraklioglu; Olivier Serres; Ahmad Anbar; Hashem Elezabi; Tarek A. El-Ghazawi

The Partitioned Global Address Space (PGAS) model, increases programmer productivity by presenting a flat memory space with locality awareness. However, the abstract representation of memory incurs overheads especially when global data is accessed. As a PGAS programming language, Chapel provides language structures to alleviate such overheads. In this work, we examine such optimizations on a set of benchmarks using multiple locales and analyze their impact on programmer productivity quantitatively. The optimization methods that we study achieved improvements over non-optimized versions ranging from 1.1 to 68.1 times depending on the benchmark characteristic.


ACM Transactions on Architecture and Code Optimization | 2016

Exploiting Hierarchical Locality in Deep Parallel Architectures

Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi

Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this article, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime that takes into account machine characteristics, as well as a data sharing and communication profile of the underlying application. This article presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale up and achieve performance gains of up to 88%.


ieee/acm international symposium cluster, cloud and grid computing | 2015

Assessing Memory Access Performance of Chapel through Synthetic Benchmarks

Engin Kayraklioglu; Tarek A. El-Ghazawi

The Partitioned Global Address Space(PGAS) programming model strikes a balance between high performance and locality awareness. As a PGAS language, Chapel relieves programmers from handling details of data movement in a distributed memory environment, by presenting a flat memory space that is logically partitioned among executing entities. Traversing such a space requires address mapping to the system virtual address space, and as such, this abstraction inevitably causes major overheads during memory accesses. In this paper, we analyzed the extent of this overhead by implementing a micro benchmark to test different types of memory accesses that can be observed in Chapel. We showed that, as the locality gets exploited speedup gains up to 35x can be achieved. This was demonstrated through hand tuning, however. More productive means should be provided to deliver such performance improvement without excessively burdening programmers. Therefore, we also discuss possibilities to increase Chapels performance through standard libraries, compiler, runtime and/or hardware support to handle different types of memory accesses more efficiently.


international parallel and distributed processing symposium | 2017

Comparative Performance and Optimization of Chapel in Modern Manycore Architectures

Engin Kayraklioglu; Wo Chang; Tarek A. El-Ghazawi

Chapel is an emerging scalable, productive parallel programming language. In this work, we analyze Chapels performance using The Parallel Research Kernels on two different manycore architectures including a state-of-the-art Intel Knights Landing processor. We discuss implementation techniques in Chapel and their relation to the OpenMP implementations of the PRK. We also suggest and prototype several optimizations in different layers of the software stack including the Chapel compiler. In our experiments we observed that base performance of Chapel ranges from 41%-184% that of OpenMP. The optimization techniques we discussed shows performance improvements ranging from 1.4x to 2x in Chapel.


2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015

PHLAME: Hierarchical Locality Exploitation Using the PGAS Model

Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi

Parallel computers are becoming deeply hierarchical. Locality aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this work, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime system that takes into account machine characteristics, data sharing and communication profile of the underlying application. This paper presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale to 1024 cores and achieve performance gains of up to 88%.


dependable autonomic and secure computing | 2015

Accelerating Brain Simulations on Graphical Processing Units

Engin Kayraklioglu; Tarek A. El-Ghazawi; Zeki Bozkus

NEural Simulation Tool(NEST) is a large scale spiking neuronal network simulator of the brain. In this work, we present a CUDA® implementation of NEST. We were able to gain a speedup of factor 20 for the computational parts of NEST execution using a different data structure than NESTs default. Our partial implementation shows the potential gains and limitations of such possible port. We discuss possible novel approaches to be able to adapt generic spiking neural network simulators such as NEST to run on commodity or high-end GPGPUs.


high performance computing and communications | 2014

Leveraging Hierarchical Data Locality in Parallel Programming Models

Ahmad Anbar; Engin Kayraklioglu; Olivier Serres; Tarek El Ghazawi

We are proposing a novel framework that ameliorates locality-aware parallel programming models, by defining hierarchical data locality model extension. We also propose a hierarchical thread partitioning algorithm. This algorithm synthesizes hierarchical thread placement layouts that targets minimizing the programs overall communication costs. We demonstrated the effectiveness of our approach using NAS Parallel Benchmarks implemented in Unified Parallel C (UPC) language using a modified Berkeley UPC Compiler and runtime system. We demonstrated an up to 85% improvement in performance by applying the placement layout suggested by our algorithm.


computing frontiers | 2018

APAT: an access pattern analysis tool for distributed arrays

Engin Kayraklioglu; Tarek A. El-Ghazawi

Distributed arrays reduce programming effort through implicit communication. However, relying solely on this abstraction causes fine-grained communication and performance overhead. A variety of optimization techniques can be used to mitigate such overheads. However, these techniques require a thorough understanding of how distributed arrays are accessed which can be very challenging in realistic use cases. We present Access Pattern Analysis Tool (APAT) for distributed arrays. APAT is a framework that can be integrated into language software stack to efficiently collect access logs and analyze them. We show that APAT can help discover optimization opportunities that can lead to up to 35% improvement.


ACM Transactions on Architecture and Code Optimization | 2018

LAPPS: Locality-Aware Productive Prefetching Support for PGAS

Engin Kayraklioglu; Michael P. Ferguson; Tarek A. El-Ghazawi

Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality awareness, we define a hybrid tradeoff. Specifically, we introduce locality-aware productive prefetching support for PGAS. Our novel, user-driven approach strikes a balance between the ease-of-use of compiler-based automated prefetching and the high performance of the laborious manual prefetching. Our prototype implementation in Chapel shows that significant scalability and performance improvements can be achieved with minimal effort in common applications.


international conference on cluster computing | 2017

HPC-Oriented Toolchain for Hardware Simulators

Olivier Serres; Engin Kayraklioglu; Tarek A. El-Ghazawi

Hardware design is an essential part of research in high performance computing. Initial efforts in hardware research consist of analyzing the design ideas in a software simulator. This allows chip designers to minimize amount of manufacturing that would be too costly and to avoid doing FPGA designs which are even more time consuming. Simulating a hardware design involves running many tests that try different configurations. Moreover, hardware simulators generally do not support multi-threaded simulation. This causes major scalability issues as simulated HPC architectures have increasing number of cores.In this paper, we present a front-end framework for hardware simulators that allows chip designers to create simulation recipes and run them in parallel. This way, a cluster can easily be used to parallelize the hardware simulations. Our framework is implemented in Python3 and have functions such as running unlimited configurations, cooperating with job managers such as Slurm and SGE and collecting and parsing results.

Collaboration


Dive into the Engin Kayraklioglu's collaboration.

Top Co-Authors

Avatar

Tarek A. El-Ghazawi

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Olivier Serres

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Ahmad Anbar

George Washington University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jeff Anderson

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Volker J. Sorger

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Hashem Elezabi

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Tarek El Ghazawi

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Vikram K. Narayana

George Washington University

View shared research outputs
Researchain Logo
Decentralizing Knowledge