Dounia Khaldi
University of Houston
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dounia Khaldi.
parallel computing | 2015
Dounia Khaldi; Pierre Jouvelot; Corinne Ancourt
BDSC schedules parallel programs in the presence of resource constraints.BDSC-based parallelization relies on static program analyses for cost modeling.BDSC-based parallelization yields significant speedups on parallel architectures. We introduce a new parallelization framework for scientific computing based on BDSC, an efficient automatic scheduling algorithm for parallel programs in the presence of resource constraints on the number of processors and their local memory size. BDSC extends Yang and Gerasouliss Dominant Sequence Clustering (DSC) algorithm; it uses sophisticated cost models and addresses both shared and distributed parallel memory architectures. We describe BDSC, its integration within the PIPS compiler infrastructure and its application to the parallelization of four well-known scientific applications: Harris, ABF, equake and IS. Our experiments suggest that BDSCs focus on efficient resource management leads to significant parallelization speedups on both shared and distributed memory systems, improving upon DSC results, as shown by the comparison of the sequential and parallelized versions of these four applications running on both OpenMP and MPI frameworks.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models | 2014
Naveen Namashivayam; Sayan Ghosh; Dounia Khaldi; Deepak Eachempati; Barbara M. Chapman
OpenSHMEM is a PGAS library that aims to deliver high performance while retaining portability. Communication operations are a major obstacle to scalable parallel performance and are highly dependent on the target architecture. However, to date there has been no work on how to efficiently support OpenSHMEM running natively on Intel Xeon Phi, a highly-parallel, power-efficient and widely-used many-core architecture. Given the importance of communication in parallel architectures, this paper describes a novel methodology for optimizing remote-memory accesses for execution of OpenSHMEM programs on Intel Xeon Phi processors. In native mode, we can exploit the Xeon Phi shared memory and convert OpenSHMEM one-sided communication calls into local load/store statements using the shmem_ptr routine. This approach makes it possible for the compiler to perform essential optimizations for Xeon Phi such as vectorization. To the best of our knowledge, this is the first time the impact of shmem_ptr is analyzed thoroughly on a many-core system. We show the benefits of this approach on the PGAS-Microbenchmarks we specifically developed for this research. Our results exhibit a decrease in latency for one-sided communication operations by up to 60% and increase in bandwidth by up to 12x. Moreover, we study different reduction algorithms and exploit local load/store to optimize data transfers in these algorithms for Xeon Phi which permits improvement of up to 22% compared to MVAPICH and up to 60% compared to Intel MPI. Apart from microbenchmarks, experimental results on NAS IS and SP benchmarks show that performance gains of up to 20x are possible.
2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) | 2016
Dounia Khaldi; Barbara M. Chapman
In this paper, we introduce a new LLVM analysis, called Bandwidth-Critical Data Analysis (BCDA), to decide when it is beneficial to allocate data in High-Bandwidth Memory (HBM) and then transform allocation calls into specific HBM allocation calls, for increased performance in parallel systems. High-Bandwidth Memory (HBM) is a new memory technology that features stacked 3D chips on processor dies.The well-known SSA-based compilation infrastructure for sequential and parallel languages LLVM will be used to detect frequently used data and patterns of memory accesses in order to decide on which level to allocate the data: HBM or DDR. BCDA core analysis counts the number of data uses and detects irregular and simultaneous accesses, generating a priority value for every variable. Using this priority value information, LLVM will generate memkind_alloc function calls, to transform mallocs to HBM allocations if HBM is present and a sufficient size of HBM is available.As a use case for validating our approach, we show how the Conjugate Gradient (CG) benchmark from the NAS Parallel suite can be optimized for the use of MCDRAM, as the HBM on the Knights Landing Xeon Phi processors is called. We implement BCDA in the LLVM compiler and apply it on CG to detect when it is beneficial to allocate data in the HBM. Then, we allocate the data in the MCDRAM using hbwmalloc library calls. Using the priority generated by BCDA, we achieved a 2.29× performance improvement using the LLVM compiler and 2.33× using Intels compiler compared to the DDR version of CG.
Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC | 2015
Dounia Khaldi; Pierre Jouvelot; François Irigoin; Corinne Ancourt; Barbara M. Chapman
We extend the LLVM intermediate representation (IR) to make it a parallel IR (LLVM PIR), which is a necessary step for introducing simple and generic parallel code optimization into LLVM. LLVM is a modular compiler that can be efficiently and easily used for static analysis, static and dynamic compilation, optimization, and code generation. Being increasingly used to address high-performance computing abstractions and hardware, LLVM will benefit from the ability to handle parallel constructs at the IR level. We use SPIRE, an incremental methodology for designing the intermediate representations of compilers that target parallel programming languages, to design LLVM PIR. Languages and libraries based on the Partitioned Global Address Space (PGAS) programming model have emerged in recent years with a focus on addressing the programming challenges for scalable parallel systems. Among these, OpenSHMEM is a library that is the culmination of a standardization effort among many implementers and users of SHMEM; it provides a means to develop light-weight, portable, scalable applications based on the PGAS programming model. As a use case for validating our LLVM PIR proposal, we show how OpenSHMEM one-sided communications can be optimized via our implementation of PIR into the LLVM compiler; we illustrate two important optimizations for such operations using loop tiling and communication vectorization.
international conference on cluster computing | 2015
Naveen Namashivayam; Deepak Eachempati; Dounia Khaldi; Barbara M. Chapman
Languages and libraries based on the Partitioned Global Address Space (PGAS) programming model have emerged in recent years with a focus on addressing the programming challenges for scalable parallel systems. Among these, Coarray Fortran (CAF) is unique in that as it has been incorporated into an existing standard (Fortran 2008), and therefore it is of particular importance that implementations supporting it are both portable and deliver sufficient levels of performance. OpenSHMEM is a library which is the culmination of a standardization effort among many implementers and users of SHMEM, and it provides a means to develop light-weight, portable, scalable applications based on the PGAS programming model. As such, we propose here that OpenSHMEM is well situated to serve as a runtime substrate for CAF implementations. In this paper, we demonstrate how OpenSHMEM can be exploited as a runtime layer upon which CAF may be implemented. Specifically, we re-targeted the CAF implementation provided in the OpenUH compiler to OpenSHMEM, and show how parallel language features provided by CAF may be directly mapped to OpenSHMEM, including allocation of remotely accessible objects, one-sided communication, and various types of synchronization. Moreover, we present and evaluate various algorithms we developed for implementing remote access of non-contiguous array sections and acquisition and release of remote locks using the OpenSHMEM interface.
international conference on cluster computing | 2015
Dounia Khaldi; Deepak Eachempati; Shiyao Ge; Pierre Jouvelot; Barbara M. Chapman
We describe how 2-level memory hierarchies can be exploited to optimize the implementation of teams in the parallel facet of the upcoming Fortran 2015 standard. We focus on reducing the cost associated with moving data within a computing node and between nodes, finding that this distinction is of key importance when looking at performance issues. We introduce a new hardware-aware approach for PGAS, to be used within a runtime system, to optimize the communications in the virtual topologies and clusters that are binding different teams together. We have applied, and implemented into the CAF OpenUH compiler, this methodology to three important collective operations, namely barrier, all-to-all reduction and one-to-all broadcast, this is the first Fortran compiler that both provides teams and handles such a memory hierarchy methodology within teams.
2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015
Shiyao Ge; Deepak Eachempati; Dounia Khaldi; Barbara M. Chapman
A set of parallel features, broadly referred to as Fortran coarrays, was added to the Fortran 2008 standard. It is expected that several new parallel features, designed to complement or augment this feature set, will be added to the next revision of the standard. This includes statements for forming and changing between image teams, as well as statements for performing communication and synchronization with respect to image teams. In this paper, we describe an early implementation and evaluation of these anticipated features within the OpenUH compiler and its CAF runtime system. We demonstrate the utility of team-based barriers in comparison to the existing sync images statement for performing synchronization amongst a team of images. Techniques for hiding synchronization and incorporating locality-awareness with a collectives implementation based on 1-sided communication are described, and we present the impact of these optimizations for allreduce operations based on message length. Our results showed better performance for medium to large sized messages compared to the corresponding allreduce implementation using the Cray Fortran Coarrays implementation. Using the new teams and collectives features, we obtained 6.2% performance improvement compared to an original Fortran 2008 version of the CG benchmark from the NAS Parallel Benchmark Suite for class D, when using 1024 images.
programming models and applications for multicores and manycores | 2017
Chen Shen; Xiaonan Tian; Dounia Khaldi; Barbara M. Chapman
The proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to considerable effort to provide suitable programming models for these accelerators, especially within the OpenMP community. While OpenMP 4.5 offers a rich set of directives, clauses and runtime calls to fully utilize accelerators, an efficient implementation of OpenMP 4.5 for GPUs remains a non-trivial task, given their multiple levels of thread parallelism. In this paper, we describe a new implementation of the corresponding features of OpenMP 4.5 for GPUs based on a one-to-one mapping of its loop hierarchy parallelism to the GPU thread hierarchy. We assess the impact of this mapping, in particular the use of GPU warps to handle innermost loop execution, on the performance of GPU execution via a set of benchmarks that include a version of the NAS parallel benchmarks specifically developed for this research; we also used the Matrix-Matrix multiplication, Jacobi, Gauss and Laplacian kernels.
international conference on parallel processing | 2016
Xiaonan Tian; Dounia Khaldi; Deepak Eachempati; Rengan Xu; Barbara M. Chapman
Using compiler directives to program accelerator-based systems through APIs such as OpenACC or OpenMP has increasingly gained popularity due to the portability and productivity advantages it offers. However, when comparing the performance typically achieved to what lower-level programming interfaces such as CUDA or OpenCL provides, directive-based approaches may entail a significant performance penalty. To support massively parallel computations, accelerators such as GPGPUs offer an expansive set of registers, larger than even the L1 cache, to hold the temporary state of each thread. Scalar variables are the mostly likely candidates to be assigned to these registers by the compiler. Hence, scalar replacement is a key enabling optimization for effectively improving the utilization of register files on accelerator devices and thereby substantially reducing the cost of memory operations. However, the aggressive application of scalar replacement may require a large number of registers, limiting the application of this technique unless mitigating approaches such as those described in this paper are taken. In this paper, we propose solutions to optimize the register usage within offloaded computations using OpenACC directives. We first present a compiler optimization called SAFARA that extends the classical scalar replacement algorithm to improve register file utilization on GPUs. Moreover, we extend the OpenACC interface by providing new clauses, namely dim and small, that will reduce the number of scalars to replace. SAFARA prioritizes the most beneficial data for allocation in registers based on frequency of use and also memory access latency. It also uses a static feedback strategy to retrieve low-level register information in order to guide the compiler in carrying out the scalar replacement transformation. Then, the new clauses we propose will extremely reduce the number of scalars, eliminating the need for more registers. We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks, our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.
Workshop on OpenSHMEM and Related Technologies | 2016
Siddhartha Jana; Tony Curtis; Dounia Khaldi; Barbara M. Chapman
Recent reports on challenges of programming models at extreme scale suggest a shift from traditional block synchronous execution models to those that support more asynchronous behavior. The OpenSHMEM programming model enables HPC programmers to exploit underlying network capabilities while designing asynchronous communication patterns. The strength of its communication model is fully realized when these patterns are characterized with small low-latency data transfers. However, for cases with large data payloads coupled with insufficient computation overlap, OpenSHMEM programs suffer from underutilized CPU cycles.