Khairul Kabir | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Khairul Kabir is active.

Explore More

Publication

Featured researches published by Khairul Kabir.

international parallel and distributed processing symposium | 2014

Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment

Azzam Haidar; Chongxiao Cao; Asim YarKhan; Piotr Luszczek; Stanimire Tomov; Khairul Kabir; Jack J. Dongarra

Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.

international conference on parallel processing | 2013

Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

Jack J. Dongarra; Mark Gates; Azzam Haidar; Yulu Jia; Khairul Kabir; Piotr Luszczek; Stanimire Tomov

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.

Scientific Programming | 2015

HPC programming on Intel many-integrated-core hardware with MAGMA port to Xeon Phi

Jack J. Dongarra; Mark Gates; Azzam Haidar; Yulu Jia; Khairul Kabir; Piotr Luszczek; Stanimire Tomov

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors.The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer fromthe specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.

euromicro conference on real-time systems | 2011

Cache-Aware Utilization Control for Energy Efficiency in Multi-Core Real-Time Systems

Xing Fu; Khairul Kabir; Xiaorui Wang

Multi-core processors are anticipated to become a major development platform for real-time systems. However, existing power management algorithms are not designed to sufficiently utilize the features available in many multi-core processors, such as shared L2 caches and per-core DVFS, to effectively minimize processor energy consumption while providing real-time guarantees. In this paper, we propose a two-level utilization control solution for energy efficiency in multi-core real-time systems. At the core level, our solution addresses two optimization objectives: controlling the CPU utilization of each core to its desired schedulable bound and minimizing the core energy consumption by adopting per-core DVFS and dynamic L2 cache partitioning to adapt both the CPU frequency-dependent and independent portions of the task execution times of the core. Since traditional control theory cannot handle multiple optimization objectives, a novel utilization controller is designed based on advanced multi-objective model predictive control theory. At the processor level, a cache demand arbitrator is proposed to coordinate the cache size demand from each core and conduct dynamic cache resizing to minimize the leakage power consumption of the shared L2 caches. The energy and time overheads of the proposed control solution are analyzed and addressed in the experiments with well-known benchmarks. Our extensive results demonstrate that our solution outperforms two state-of-the-art power management algorithms that do not consider L2 caches or per-core DVFS, by having more accurate utilization control and less energy consumption.

ieee international conference on high performance computing, data, and analytics | 2015

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Khairul Kabir; Azzam Haidar; Stanimire Tomov; Jack J. Dongarra

The manycore paradigm shift, and the resulting change in modern computer architectures, has made the development of optimal numerical routines extremely challenging. In this work, we target the development of numerical algorithms and implementations for Xeon Phi coprocessor architecture designs. In particular, we examine and optimize the general and symmetric matrix-vector multiplication routines (gemv/symv), which are some of the most heavily used linear algebra kernels in many important engineering and physics applications. We describe a successful approach on how to address the challenges for this problem, starting with our algorithm design, performance analysis and programing model and moving to kernel optimization. Our goal, by targeting low-level and easy to understand fundamental kernels, is to develop new optimization strategies that can be effective elsewhere for use on manycore coprocessors, and to show significant performance improvements compared to existing state-of-the-art implementations. Therefore, in addition to the new optimization strategies, analysis, and optimal performance results, we finally present the significance of using these routines/strategies to accelerate higher-level numerical algorithms for the eigenvalue problem (EVP) and the singular value decomposition (SVD) that by themselves are foundational for many important applications.

international supercomputing conference | 2017

A Framework for Out of Memory SVD Algorithms

Khairul Kabir; Azzam Haidar; Stanimire Tomov; Aurelien Bouteiller; Jack J. Dongarra

Many important applications – from big data analytics to information retrieval, gene expression analysis, and numerical weather prediction – require the solution of large dense singular value decompositions (SVD). In many cases the problems are too large to fit into the computer’s main memory, and thus require specialized out-of-core algorithms that use disk storage. In this paper, we analyze the SVD communications, as related to hierarchical memories, and design a class of algorithms that minimizes them. This class includes out-of-core SVDs but can also be applied between other consecutive levels of the memory hierarchy, e.g., GPU SVD using the CPU memory for large problems. We call these out-of-memory (OOM) algorithms. To design OOM SVDs, we first study the communications for both classical one-stage blocked SVD and two-stage tiled SVD. We present the theoretical analysis and strategies to design, as well as implement, these communication avoiding OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models.

ieee high performance extreme computing conference | 2017

Out of memory SVD solver for big data

Azzam Haidar; Khairul Kabir; Diana Fayad; Stanimire Tomov; Jack J. Dongarra

Many applications — from data compression to numerical weather prediction and information retrieval — need to compute large dense singular value decompositions (SVD). When the problems are too large to fit into the computers main memory, specialized out-of-core algorithms that use disk storage are required. A typical example is when trying to analyze a large data set through tools like MATLAB or Octave, but the data is just too large to be loaded. To overcome this, we designed a class of out-of-memory (OOM) algorithms to reduce, as well as overlap communication with computation. Of particular interest is OOM algorithms for matrices of size m × n, where m >> n or m << n, e.g., corresponding to cases of too many variables, or too many observations. To design OOM SVDs, we first study the communications cost for the SVD techniques as well as for the QR/LQ factorization followed by SVD. We present the theoretical analysis about the data movement cost and strategies to design OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. Moreover, our experimental results show the feasibility and superiority of the OOM SVD.

international conference on computational science | 2015

Performance Analysis and Optimisation of Two-Sided Factorization Algorithms for Heterogeneous Platform

Khairul Kabir; Azzam Haidar; Stanimire Tomov; Jack J. Dongarra

Many applications, ranging from big data analytics to nanostructure designs, require the solution of large dense singular value decomposition (SVD) or eigenvalue problems. A first step in the solution methodology for these problems is the reduction of the matrix at hand to condensed form by two-sided orthogonal transformations. This step is standardly used to significantly accelerate the solution process. We present a performance analysis of the main two-sided factorizations used in these reductions: the bidiagonalization, tridiagonalization, and the upper Hessenberg factorizations on heterogeneous systems of multicore CPUs and Xeon Phi coprocessors. We derive a performance model and use it to guide the analysis and to evaluate performance. We develop optimized implementations for these methods that get up to 80% of the optimal performance bounds. Finally, we describe the heterogeneous multicore and coprocessor development considerations and the techniques that enable us to achieve these high-performance results. The work here presents the first highly optimized implementation of these main factorizations for Xeon Phi coprocessors. Compared to the LAPACK versions optmized by Intel for Xeon Phi (in MKL), we achieve up to 50% speedup.

international parallel and distributed processing symposium | 2016

Heterogeneous Streaming

Chris J. Newburn; Gaurav Bansal; Michael Wood; Luis Crivelli; Judit Planas; Alejandro Duran; Paulo Souza; Leonardo Borges; Piotr Luszczek; Stanimire Tomov; Jack J. Dongarra; Hartwig Anzt; Mark Gates; Azzam Haidar; Yulu Jia; Khairul Kabir; Ichitaro Yamazaki; Jesús Labarta

ieee international conference on high performance computing data and analytics | 2015