Abid Muslim Malik
University of Houston
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Abid Muslim Malik.
international parallel and distributed processing symposium | 2014
Ahmad Qawasmeh; Abid Muslim Malik; Barbara M. Chapman
OpenMP tasks propose a new dimension of concurrency to cap irregular parallelism within applications. The addition of OpenMP tasks allows programmers to express concurrency at a high level of abstraction and makes the OpenMP runtime responsible about the burden of scheduling parallel execution. The ability to observe the performance of OpenMP task scheduling strategies portably across shared memory platforms has been a challenge due to the lack of performance interface standards in the runtime layer. In this paper, we exploit our proposed tasking extensions to the OpenMP Runtime API (ORA), Known as Collector APIs, for profiling task level parallelism. We describe the integration of these Collector APIs, implemented in the OpenUH compiler, into the TAU performance system. Our proposed task extensions are in line with the new interface specification called OMPT, which is currently under evaluation by the OpenMP community. We use this integration to analyze various OpenMP task scheduling strategies implemented in OpenUH. The capabilities of these scheduling strategies are evaluated with respect to exploiting data locality, maintaining load balance, and minimizing overhead costs. We present a comprehensive performance study of diverse OpenMP benchmarks, from the Barcelona OpenMP Test Suite, comparing different task pools (DEFAULT, SIMPLE, SIMPLE_2LEVEL, PUBLIC PRIVATE), task queues (DEQUE, FIFO, CFIFO, LIFO, INV_DEQUE), and task queue storages (ARRAY, DYN_ARRAY, LIST, LOCKLESS) on an AMD Opteron multicore system (48 cores total). Our results show that the benchmarks with similar characteristics exhibit the same behavior in terms of the performance of the applied scheduling strategies. Moreover, the task pool configuration, which controls the organization of task queues, was found to have the highest impact on performance.
international workshop on openmp | 2013
Ahmad Qawasmeh; Abid Muslim Malik; Barbara M. Chapman; Kevin A. Huck; Allen D. Malony
The introduction of tasks in the OpenMP programming model brings a new level of parallelism. This also creates new challenges with respect to its meanings and applicability through an event-based performance profiling. The OpenMP Architecture Review Board (ARB) has approved an interface specification known as the “OpenMP Runtime API for Profiling” to enable performance tools to collect performance data for OpenMP programs. In this paper, we propose new extensions to the OpenMP Runtime API for profiling task level parallelism. We present an efficient method to distinguish individual task instances in order to track their associated events at the micro level. We implement the proposed extensions in the OpenUH compiler which is an open-source OpenMP compiler. With negligible overheads, we are able to capture important events like task creation, execution, suspension, and exiting. These events help in identifying overheads associated with the OpenMP tasking model, e.g., task waiting until a task starts execution or task cleanup etc. These events also help in constructing important parent-child relationships that define tasks’ call paths. The proposed extensions are in line with the newest specifications recently proposed by the OpenMP tools committee for task profiling.
international conference on machine learning and applications | 2010
Abid Muslim Malik
Modern compilers provide optimization options to obtain better performance for a given program. Effective selection of optimization options is a challenging task. Recent work has shown that machine learning can be used to select the best compiler optimization options for a given program. Machine learning techniques rely upon selecting features which represent a program in the best way. The quality of these features is critical to the performance of machine learning techniques. Previous work on feature selection for program representation is based on code size, mostly executed parts, parallelism and memory access patterns with-in a program. Spatial based information–how instructions are distributed with-in a program–has never been studied to generate features for the best compiler options selection using machine learning techniques. In this paper, we present a framework that address how to capture the spatial information with-in a program and transform it to features for machine learning techniques. An extensive experimentation is done using the SPEC2006 and MiBench benchmark applications. We compare our work with the IBM Milepost-gcc framework. The Milepost work gives a comprehensive set of features for using machine learning techniques for the best compiler options selection problem. Results show that the performance of machine learning techniques using spatial based features is better than the performance using the Milepost framework. With 66 available compiler options, we are also able to achieve 70% of the potential speed up obtained through an iterative compilation.
international conference on cluster computing | 2016
Abdullah Shahneous Bari; Nicholas Chaimov; Abid Muslim Malik; Kevin A. Huck; Barbara M. Chapman; Allen D. Malony; Osman Sarood
Power is the most critical resource for the exascale high performance computing. In the future, system administrators might have to pay attention to the power consumption of the machine under different work loads. Hence, each application may have to run with an allocated power budget. Thus, achieving the best performance on future machines requires optimal performance subject to a power constraint. This additional performance requirement should not be the responsibility of HPC~(High Performance Computing) application developers. Optimizing the performance for a given power budget should be the responsibility of high-performance system software stack. Modern machines allow power capping of CPU and memory to implement power budgeting strategy. Finding the best runtime environment for a node at a given power level is important to get the best performance. This paper presents ARCS (Adaptive Runtime Configuration Selection) frameworkthat automatically selects the best runtime configuration for each OpenMPparallel region at a given power level. The framework uses OMPT (OpenMP Tools) API, APEX(Autonomic Performance Environment for eXascale), and Active Harmony frameworksto explore configuration search space and selects the best number of threads, scheduling policy, and chunk size for a given power level at run-time. We test ARCS using the NAS Parallel Benchmark, and proxy application LULESH with Intel Sandybridge, and IBM Power multi-core architectures. We show that for a given power level, efficient OpenMP runtime parameter selection can improve the execution time and energy consumption of an application up to 40% and 42% respectively.
international workshop on energy efficient supercomputing | 2014
Anilkumar Nandamuri; Abid Muslim Malik; Ahmad Qawasmeh; Barbara M. Chapman
Power and energy have become dominant aspects of hardware and software design in the High Performance Computing (HPC). Recently, the Department of Defense (DOD) has put a constraint that applications and architectures need to attain 75 GFLOPS/Watt in order to support the future missions. This requires a significant research effort towards power and energy optimization. OpenMP programming model is an integral part of HPC. Comprehensive analysis of OpenMP programs for power and execution performance is an active research area. Work has been done to characterize OpenMP programs with respect to power performance at kernel level. However, no work has been done at the OpenMP event level. OpenMP Runtime API (ORA), proposed by the OpenMP standard committee, allow a performance tool to collect information at the OpenMP event level. In this paper, we present a comprehensive analysis of the OpenMP programs using ORA for power and execution performance. Using hardware counters in the Intel SandyBridge x86-64 and Running Average Power Limit (RAPL) energy sensors, we measure power and energy characteristics of OpenMP benchmarks. Our results show that the best execution performance does not always give the best energy usage. We also find out that the waiting time at the barriers and in queue are the main factors for high power consumption for a given OpenMP program. Our results also show that there are unique patterns at the fine level that can be used by the dynamic power management system to enhance the power performance. Our results show substantial variation in energy usage depending upon the runtime environment.
international conference on machine learning and applications | 2012
Abid Muslim Malik
One of the key feature of modern architectures is deep memory hierarchies. In order to exploit this feature, one has to expose data locality with-in a program. Loop tiling is an optimization phase in modern compilers which is used to transform a loop for exposing data locality. Selecting the best tile size for a given architecture and compiler is known as Optimal Tile Size Selection Problem. It is a NP-hard problem. People have build cost models for this problem that characterize the performance of a program as a function of tile sizes. The best tile size for a given loop is determined directly by using these models. Hand crafting an accurate tile size selection cost model is hard. Can we automatically learn a tile size selection model? This is an important question. In this paper, we have shown that a fairly accurate model can be learned using simple program dynamic features with standard machine learning techniques. We evaluate our approach on different architecture and compiler combinations. The model given by us consistently shows near-optimal performance (within 4% of the optimal) across all architecture and compiler combinations.
ieee international conference on high performance computing data and analytics | 2009
Thorsten Matthias Riechers; Shyh-hao Kuo; Rick Siow Mong Goh; Harold Soh; Terence Hung; Abid Muslim Malik
We present in this paper a case study of the benefits of refactoring a large-scale agent-based infectious disease simulator into components. This approach of software development has the potential to increase the productivity of the developer as well as to offer better application performance through a component-based programming paradigm. In our approach, we have designed and developed a generalized component-based optimizer for the compositional adaptation of the compute-intensive kernels. This optimizer provides an efficient mechanism for creating multiple implementations of kernels. It possesses dynamic adaptation feature which selects the most appropriate kernel implementation based on runtime conditions so as to achieve performance that is otherwise unattainable with a single implementation. Our experimental results demonstrate that the component-based simulator with dynamic adaptation enabled has performance gain of 2 to 4 times as compared to the original non-component-based code.
international workshop on openmp | 2015
Millad Ghane; Abid Muslim Malik; Barbara M. Chapman; Ahmad Qawasmeh
Writing a parallel shared memory application that scales well on the future multi-core processors is a challenging task. The contention among shared resources increases as the number of threads increases. This may cause a false sharing problem, which can degrade the performance of an application. OpenMP Tools (OMPT) [2]- a performance tool APIs for OpenMP- enables performance tools to gather useful performance related information from OpenMP applications with lower overhead. In this paper, we propose a light-weight false sharing detection technique for OpenMP programming model using OMPT. We show that the OMPT framework has the ability to detect unique patterns that can be used to build a quality detection model for false sharing in OpenMP programs. In this work, we treat the false sharing detection problem as a binary classification problem. We develop a set of OpenMP programs in which false sharing can be turned on and off. We run these programs both with and without false sharing and collect a set of hardware performance event counts using OMPT. We use the collected data to train a binary classifier. We test the trained classifier using the NAS Parallel Benchmark applications. Our experiments show that the trained classifier can detect false sharing cases with an average accuracy of around 90 %.
international conference on machine learning and applications | 2015
Ahmad Qawasmeh; Abid Muslim Malik; Barbara M. Chapman
Task-based programming models adopt different scheduling strategies to exploit parallelism in irregular applications. These scheduling strategies differ in terms of exploiting data locality, maintaining load balance, and minimizing overhead. OpenMP tasks allow programmers to express unstructured parallelism at a high level of abstraction and make the runtime responsible about the burden of scheduling parallel execution. For irregular applications, the performance of task scheduling cannot often be predicted due to the nature of application, the used compiler, and the platform/architecture dependencies. In this work, we introduce an automatic, portable, and adaptive runtime feedback-driven framework (APARF) that combines standard low-level tasking runtime APIs, a developed profiling tool, and a hybrid machine learning model. We employ APARF to select the optimum task scheduling scheme of any given application using similarity analysis through the correlation between the captured runtime APIs with low profiling costs. Our hybrid model predicts the best scheduling strategy for a variety of unseen applications with an average accuracy of 93%, while maintaining a 100% training accuracy. An average performance enhancement of 25% was obtained compared with the default configuration, when APARF was applied on different unseen programs. APARF was examined against a real application (Molecular Dynamics), where we achieved up to 31% performance improvement. Compared to Intel, PGI and GNU compilers, our predicted scheme achieved better performance in most cases.
ieee international conference on high performance computing data and analytics | 2012
Yonghong Yan; Jeremy Kemp; Xiaonan Tian; Abid Muslim Malik; Barbara M. Chapman
For many scientific applications, dense matrix multiplication is one of the most important and computation intensive linear algebra operations. An efficient matrix multiplication on high performance and parallel computers requires optimizations on how matrices are decomposed and exchanged between computational nodes to reduce communication and synchronization overhead, as well as to efficiently exploit the memory hierarchy within a node to improve both spatial and temporal data locality. In this paper, we presented our studies of performance, cache behavior, and energy efficiency of multiple parallel matrix multiplication algorithms on a multicore desktop computer and a medium-size shared memory machine, both being considered as referenced sizes of nodes to create a medium- and largescale computational clusters for high performance computing used in industry and national laboratories. Our results highlight both the performance and energy efficiencies, and also provide implications on the memory and resources pressures of those algorithms. We hope this could help users choose the appropriate implementations according to their specific data sets when composing larger-scale scientific applications that use parallel matrix multiplication kernels on a node.