Ahmad Abdelfattah | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ahmad Abdelfattah is active.

Explore More

Publication

Featured researches published by Ahmad Abdelfattah.

ieee international conference on high performance computing data and analytics | 2016

Performance, Design, and Autotuning of Batched GEMM for GPUs

Ahmad Abdelfattah; Azzam Haidar; Stanimire Tomov; Jack J. Dongarra

The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general.

international conference on conceptual structures | 2016

High-performance Tensor Contractions for GPUs

Ahmad Abdelfattah; Marc Baboulin; Veselin Dobrev; Jack J. Dongarra; Christopher Earl; Joel Falcou; Azzam Haidar; Ian Karlin; Tzanio V. Kolev; Ian Masliah; Stanimire Tomov

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8 faster than CUBLAS, and 8.5 faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

european conference on parallel processing | 2016

High-Performance Matrix-Matrix Multiplications of Very Small Matrices

Ian Masliah; Ahmad Abdelfattah; Azzam Haidar; Stanimire Tomov; Marc Baboulin; Joel Falcou; Jack J. Dongarra

The use of the general dense matrix-matrix multiplication GEMM is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices of sizes less than 32 however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90i¾ź% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.

ACM Transactions on Mathematical Software | 2016

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

Ahmad Abdelfattah; David E. Keyes; Hatem Ltaief

KBLAS is an open-source, high-performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameters, KBLAS efficiently runs on various GPU architectures while avoiding code rewriting and retaining compliance with the standard BLAS API. Another optimization technique allows ensuring coalesced memory access when dealing with submatrices, especially for high-level dense linear algebra algorithms. All KBLAS kernels have been leveraged to a multi-GPU environment, which requires the introduction of new APIs. Considering general matrices, KBLAS is very competitive with existing state-of-the-art kernels and provides a smoother performance across a wide range of matrix dimensions. Considering symmetric and Hermitian matrices, the KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes and achieves asymptotically up to 50p and 60p speedup against the best competitor on single GPU and multi-GPUs systems, respectively. Performance results also validate our performance model. A subset of KBLAS high-performance kernels have been integrated into NVIDIAs standard BLAS implementation (cuBLAS) for larger dissemination, starting from version 6.0.

ieee international conference on high performance computing data and analytics | 2014

Pipelining computational stages of the tomographic reconstructor for multi-object adaptive optics on a multi-GPU system

Ali Charara; Hatem Ltaief; Damien Gratadour; David E. Keyes; A. Sevin; Ahmad Abdelfattah; Eric Gendron; Carine Morel; Fabrice Vidal

The European Extremely Large Telescope project (E-ELT) is one of Europes highest priorities in ground-based astronomy. ELTs are built on top of a variety of highly sensitive and critical astronomical instruments. In particular, a new instrument called MOSAIC has been proposed to perform multi-object spectroscopy using the Multi-Object Adaptive Optics (MOAO) technique. The core implementation of the simulation lies in the intensive computation of a tomographic reconstruct or (TR), which is used to drive the deformable mirror in real time from the measurements. A new numerical algorithm is proposed (1) to capture the actual experimental noise and (2) to substantially speed up previous implementations by exposing more concurrency, while reducing the number of floating-point operations. Based on the Matrices Over Runtime System at Exascale numerical library (MORSE), a dynamic scheduler drives all computational stages of the tomographic reconstruct or simulation and allows to pipeline and to run tasks out-of order across different stages on heterogeneous systems, while ensuring data coherency and dependencies. The proposed TR simulation outperforms asymptotically previous state-of-the-art implementations up to 13-fold speedup. At more than 50000 unknowns, this appears to be the largest-scale AO problem submitted to computation, to date, and opens new research directions for extreme scale AO simulations.

international conference on conceptual structures | 2016

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs

Ahmad Abdelfattah; Azzam Haidar; Stanimire Tomov; Jack J. Dongarra

Solving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.

european conference on parallel processing | 2015

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Ahmad Abdelfattah; Hatem Ltaief; David E. Keyes

Leveraging optimization techniques (e.g., register blocking and double buffering) introduced in the context of KBLAS, a Level 2 BLAS high performance library on GPUs, the authors implement dense matrix-vector multiplications within a sparse-block structure. While these optimizations are important for high performance dense kernel executions, they are even more critical when dealing with sparse linear algebra operations. The most time-consuming phase of many multicomponent applications, such as models of reacting flows or petroleum reservoirs, is the solution at each implicit time step of large, sparse spatially structured or unstructured linear systems. The standard method is a preconditioned Krylov solver. The Sparse Matrix-Vector multiplication (SpMV) is, in turn, one of the most time-consuming operations in such solvers. Because there is no data reuse of the elements of the matrix within a single SpMV, kernel performance is limited by the speed at which data can be transferred from memory to registers, making the bus bandwidth the major bottleneck. On the other hand, in case of a multi-species model, the resulting Jacobian has a dense block structure. For contemporary petroleum reservoir simulations, the block size typically ranges from three to a few dozen among different models, and still larger blocks are relevant within adaptively model-refined regions of the domain, though generally the size of the blocks, related to the number of conserved species, is constant over large regions within a given model. This structure can be exploited beyond the convenience of a block compressed row data format, because it offers opportunities to hide the data motion with useful computations. The new SpMV kernel outperforms existing state-of-the-art implementations on single and multi-GPUs using matrices with dense block structure representative of porous media applications with both structured and unstructured multi-component grids.

Proceedings of SPIE | 2014

A novel fast and accurate pseudo-analytical simulation approach for MOAO

Eric Gendron; Ali Charara; Ahmad Abdelfattah; Damien Gratadour; David E. Keyes; Hatem Ltaief; Carine Morel; Fabrice Vidal; A. Sevin; Gerard Rousset

Multi-object adaptive optics (MOAO) is a novel adaptive optics (AO) technique for wide-field multi-object spectrographs (MOS). MOAO aims at applying dedicated wavefront corrections to numerous separated tiny patches spread over a large field of view (FOV), limited only by that of the telescope. The control of each deformable mirror (DM) is done individually using a tomographic reconstruction of the phase based on measurements from a number of wavefront sensors (WFS) pointing at natural and artificial guide stars in the field. We have developed a novel hybrid, pseudo-analytical simulation scheme, somewhere in between the end-to- end and purely analytical approaches, that allows us to simulate in detail the tomographic problem as well as noise and aliasing with a high fidelity, and including fitting and bandwidth errors thanks to a Fourier-based code. Our tomographic approach is based on the computation of the minimum mean square error (MMSE) reconstructor, from which we derive numerically the covariance matrix of the tomographic error, including aliasing and propagated noise. We are then able to simulate the point-spread function (PSF) associated to this covariance matrix of the residuals, like in PSF reconstruction algorithms. The advantage of our approach is that we compute the same tomographic reconstructor that would be computed when operating the real instrument, so that our developments open the way for a future on-sky implementation of the tomographic control, plus the joint PSF and performance estimation. The main challenge resides in the computation of the tomographic reconstructor which involves the inversion of a large matrix (typically 40 000 × 40 000 elements). To perform this computation efficiently, we chose an optimized approach based on the use of GPUs as accelerators and using an optimized linear algebra library: MORSE providing a significant speedup against standard CPU oriented libraries such as Intel MKL. Because the covariance matrix is symmetric, several optimization schemes can be envisioned to speedup even further the computation. Optimizing the speed of the reconstructor computation is of major interest not only for the design study of MOAO instruments, but also for future routine operations of the system as the reconstructor has to be updated regularly to cope for atmospheric variability.

international conference on parallel processing | 2012

Systematic approach in optimizing numerical memory-bound kernels on GPU

Ahmad Abdelfattah; David E. Keyes; Hatem Ltaief

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying devices architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0.

ieee international conference on high performance computing data and analytics | 2012

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Ahmad Abdelfattah; Jack J. Dongarra; David E. Keyes; Hatem Ltaief

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.

Explore More