Engin Kayraklioglu
George Washington University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Engin Kayraklioglu.
international parallel and distributed processing symposium | 2016
Engin Kayraklioglu; Olivier Serres; Ahmad Anbar; Hashem Elezabi; Tarek A. El-Ghazawi
The Partitioned Global Address Space (PGAS) model, increases programmer productivity by presenting a flat memory space with locality awareness. However, the abstract representation of memory incurs overheads especially when global data is accessed. As a PGAS programming language, Chapel provides language structures to alleviate such overheads. In this work, we examine such optimizations on a set of benchmarks using multiple locales and analyze their impact on programmer productivity quantitatively. The optimization methods that we study achieved improvements over non-optimized versions ranging from 1.1 to 68.1 times depending on the benchmark characteristic.
ACM Transactions on Architecture and Code Optimization | 2016
Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi
Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this article, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime that takes into account machine characteristics, as well as a data sharing and communication profile of the underlying application. This article presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale up and achieve performance gains of up to 88%.
ieee/acm international symposium cluster, cloud and grid computing | 2015
Engin Kayraklioglu; Tarek A. El-Ghazawi
The Partitioned Global Address Space(PGAS) programming model strikes a balance between high performance and locality awareness. As a PGAS language, Chapel relieves programmers from handling details of data movement in a distributed memory environment, by presenting a flat memory space that is logically partitioned among executing entities. Traversing such a space requires address mapping to the system virtual address space, and as such, this abstraction inevitably causes major overheads during memory accesses. In this paper, we analyzed the extent of this overhead by implementing a micro benchmark to test different types of memory accesses that can be observed in Chapel. We showed that, as the locality gets exploited speedup gains up to 35x can be achieved. This was demonstrated through hand tuning, however. More productive means should be provided to deliver such performance improvement without excessively burdening programmers. Therefore, we also discuss possibilities to increase Chapels performance through standard libraries, compiler, runtime and/or hardware support to handle different types of memory accesses more efficiently.
international parallel and distributed processing symposium | 2017
Engin Kayraklioglu; Wo Chang; Tarek A. El-Ghazawi
Chapel is an emerging scalable, productive parallel programming language. In this work, we analyze Chapels performance using The Parallel Research Kernels on two different manycore architectures including a state-of-the-art Intel Knights Landing processor. We discuss implementation techniques in Chapel and their relation to the OpenMP implementations of the PRK. We also suggest and prototype several optimizations in different layers of the software stack including the Chapel compiler. In our experiments we observed that base performance of Chapel ranges from 41%-184% that of OpenMP. The optimization techniques we discussed shows performance improvements ranging from 1.4x to 2x in Chapel.
2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015
Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi
Parallel computers are becoming deeply hierarchical. Locality aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this work, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime system that takes into account machine characteristics, data sharing and communication profile of the underlying application. This paper presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale to 1024 cores and achieve performance gains of up to 88%.
dependable autonomic and secure computing | 2015
Engin Kayraklioglu; Tarek A. El-Ghazawi; Zeki Bozkus
NEural Simulation Tool(NEST) is a large scale spiking neuronal network simulator of the brain. In this work, we present a CUDA® implementation of NEST. We were able to gain a speedup of factor 20 for the computational parts of NEST execution using a different data structure than NESTs default. Our partial implementation shows the potential gains and limitations of such possible port. We discuss possible novel approaches to be able to adapt generic spiking neural network simulators such as NEST to run on commodity or high-end GPGPUs.
high performance computing and communications | 2014
Ahmad Anbar; Engin Kayraklioglu; Olivier Serres; Tarek El Ghazawi
We are proposing a novel framework that ameliorates locality-aware parallel programming models, by defining hierarchical data locality model extension. We also propose a hierarchical thread partitioning algorithm. This algorithm synthesizes hierarchical thread placement layouts that targets minimizing the programs overall communication costs. We demonstrated the effectiveness of our approach using NAS Parallel Benchmarks implemented in Unified Parallel C (UPC) language using a modified Berkeley UPC Compiler and runtime system. We demonstrated an up to 85% improvement in performance by applying the placement layout suggested by our algorithm.
computing frontiers | 2018
Engin Kayraklioglu; Tarek A. El-Ghazawi
Distributed arrays reduce programming effort through implicit communication. However, relying solely on this abstraction causes fine-grained communication and performance overhead. A variety of optimization techniques can be used to mitigate such overheads. However, these techniques require a thorough understanding of how distributed arrays are accessed which can be very challenging in realistic use cases. We present Access Pattern Analysis Tool (APAT) for distributed arrays. APAT is a framework that can be integrated into language software stack to efficiently collect access logs and analyze them. We show that APAT can help discover optimization opportunities that can lead to up to 35% improvement.
ACM Transactions on Architecture and Code Optimization | 2018
Engin Kayraklioglu; Michael P. Ferguson; Tarek A. El-Ghazawi
Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality awareness, we define a hybrid tradeoff. Specifically, we introduce locality-aware productive prefetching support for PGAS. Our novel, user-driven approach strikes a balance between the ease-of-use of compiler-based automated prefetching and the high performance of the laborious manual prefetching. Our prototype implementation in Chapel shows that significant scalability and performance improvements can be achieved with minimal effort in common applications.
international conference on cluster computing | 2017
Olivier Serres; Engin Kayraklioglu; Tarek A. El-Ghazawi
Hardware design is an essential part of research in high performance computing. Initial efforts in hardware research consist of analyzing the design ideas in a software simulator. This allows chip designers to minimize amount of manufacturing that would be too costly and to avoid doing FPGA designs which are even more time consuming. Simulating a hardware design involves running many tests that try different configurations. Moreover, hardware simulators generally do not support multi-threaded simulation. This causes major scalability issues as simulated HPC architectures have increasing number of cores.In this paper, we present a front-end framework for hardware simulators that allows chip designers to create simulation recipes and run them in parallel. This way, a cluster can easily be used to parallelize the hardware simulations. Our framework is implemented in Python3 and have functions such as running unlimited configurations, cooperating with job managers such as Slurm and SGE and collecting and parsing results.