Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Apan Qasem is active.

Publication


Featured researches published by Apan Qasem.


computing frontiers | 2011

Understanding stencil code performance on multicore architectures

Shah Mohammad Faizur Rahman; Qing Yi; Apan Qasem

Stencil computations are the foundation of many large applications in scientific computing. Previous research has shown that several optimization mechanisms, including rectangular blocking and time skewing combined with wavefront- and pipeline-based parallelization, can be used to significantly improve the performance of stencil kernels on multi-core architectures. However, the overall performance impact of these optimizations are difficult to predict due to the interplay of load imbalance, synchronization overhead, and cache locality. This paper presents a detailed performance study of these optimizations by applying them with a wide variety of different configurations, using hardware counters to monitor the efficiency of architectural components, and then developing a set of formulas via regression analysis to model their overall performance impact in terms of the affected hardware counter numbers. We have applied our methodology to three stencil computation kernels, a 7-point jacobi, a 27-point jacobi, and a 7-point Gauss-Seidel computation. Our experimental results show that a precise formula can be developed for each kernel to accurately model the overall performance impact of varying optimizations and thereby effectively guide the performance analysis and tuning of these kernels.


compiler construction | 2012

Automatic restructuring of GPU kernels for exploiting inter-thread data locality

Swapneela Unkule; Christopher Shaltz; Apan Qasem

Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in todays HPC world. For many applications, however, achieving a high fraction of peak on current GPUs, still requires significant programmer effort. A key consideration for optimizing GPU code is determining a suitable amount of work to be performed by each thread. Thread granularity not only has a direct impact on occupancy but can also influence data locality at the register and shared-memory levels. This paper describes a software framework to analyze dependencies in parallel GPU threads and perform source-level restructuring to obtain GPU kernels with varying thread granularity. The framework supports specification of coarsening factors through source-code annotation and also implements a heuristic based on estimated register pressure that automatically recommends coarsening factors for improved memory performance. We present preliminary experimental results on a select set of CUDA kernels. The results show that the proposed strategy is generally able to select profitable coarsening factors. More importantly, the results demonstrate a clear need for automatic control of thread granularity at the software level for achieving higher performance.


languages and compilers for parallel computing | 2008

Exploring the Optimization Space of Dense Linear Algebra Kernels

Qing Yi; Apan Qasem

Dense linear algebra kernels such as matrix multiplication have been used as benchmarks to evaluate the effectiveness of many automated compiler optimizations. However, few studies have looked at collectively applying the transformations and parameterizing them for external search. In this paper, we take a detailed look at the optimization space of three dense linear algebra kernels. We use a transformation scripting language (POET) to implement each kernel-level optimization as applied by ATLAS. We then extensively parameterize these optimizations from the perspective of a general-purpose compiler and use a stand-alone empirical search engine to explore the optimization space using several different search strategies. Our exploration of the search space reveals key interaction among several transformations that must be considered by compilers to approach the level of efficiency obtained through manual tuning of kernels.


high performance computing and communications | 2009

Balancing Locality and Parallelism on Shared-cache Mulit-core Systems

Michael Jason Cade; Apan Qasem

The emergence of multi-core systems opens new opportunities for thread-level parallelism and dramatically increases the performance potential of applications running on these systems. However, the state of the art in performance enhancing software is far from adequate in regards to the exploitation of hardware features on this complex new architecture. As a result, much of the performance capabilities of multi-core systems are yet to be realized. This research addresses one facet of this problem by exploring the relationship between data-locality and parallelism in the context of multi-core architectures where one or more levels of cache are shared among the different cores. A model is presented for determining a profitable synchronization interval for concurrent threads that interact in a producer-consumer relationship. Experimental results suggest that consideration of the synchronization window, or the amount of work individual threads can be allowed to do between synchronizations, allows for parallelism- and locality-aware performance optimizations. The optimum synchronization window is a function of the number of threads, data reuse patterns within the workload, and the size and configuration of the last-level of cache that is shared among processing units. By considering these factors, the calculation of the optimum synchronization window incorporates parallelism and data locality issues for maximum performance.


Proceedings of the 2015 XSEDE Conference on Scientific Advancements Enabled by Enhanced Cyberinfrastructure | 2015

A SIMD tabu search implementation for solving the quadratic assignment problem with GPU acceleration

Clara Novoa; Apan Qasem; Abhilash Chaparala

In the Quadratic Assignment Problem (QAP), n units (usually departments, machines, or electronic components) must be assigned to n locations given the distance between the locations and the flow between the units. The goal is to find the assignment that minimizes the sum of the products of distance traveled and flow between units. The QAP is a combinatorial problem difficult to solve to optimality even for problems where n is relatively small (e.g., n = 30). In this paper, we develop a parallel tabu search algorithm to solve the QAP and leverage the compute capabilities of current GPUs. The single instruction multiple data (SIMD) algorithm is implemented on the Stampede cluster hosted by the Texas Advanced Computing Center (TACC) at the University of Texas at Austin. We enhance our implementation by exploiting the dynamic parallelism made available in the Nvidia Kepler high performance computing architecture. On a series of experiments on the well-known QAPLIB data sets, our algorithm produces solutions that are as good as the best known ones posted in QAPLIB. The worst case percentage of accuracy we obtained was 0.83%. Given the applicability of QAP, our algorithm has very good potential to accelerate scholarly research in Engineering, in the fields of Operations Research and design of electronic devices. To the best of our knowledge, this work is the first to successfully parallelize the tabu search metaheuristic to solve the QAP with the recency-based feature, implemented serially in [10]. Our work is also the first to exploit GPU dynamic parallelism in a tabu search metaheuristic to solve the QAP.


network and parallel computing | 2010

Exposing tunable parameters in multi-threaded numerical code

Apan Qasem; Jichi Guo; Faizur Rahman; Qing Yi

Achieving high performance on todays architectures requires careful orchestration of many optimization parameters. In particular, the presence of shared-caches on multicore architecturesmakes it necessary to consider, in concert, issues related to both parallelism and data locality. This paper presents a systematic and extensive exploration of thecombined search space of transformation parameters that affect both parallelism and data locality inmulti-threaded numerical applications.We characterize the nature of the complex interaction between blocking, problem decomposition and selection of loops for parallelism. We identify key parameters for tuning and provide an automatic mechanism for exposing these parameters to a search tool. A series of experiments on two scientific benchmarks illustrates the non-orthogonality of the transformation search space and reiterates the need for integrated transformation heuristics for achieving high-performance on current multicore architectures.


high performance computing and communications | 2015

Maximizing Hardware Prefetch Effectiveness with Machine Learning

Saami Rahman; Martin Burtscher; Ziliang Zong; Apan Qasem

Modern processors are equipped with multiple hardware prefetchers, each of which targets a distinct level in the memory hierarchy and employs a separate prefetching algorithm. However, different programs require different subsets of these prefetchers to maximize their performance. Turning on all available prefetchers rarely yields the best performance and, in some cases, prefetching even hurts performance. This paper studies the effect of hardware prefetching on multithreaded code and presents a machine-learning technique to predict the optimal combination of prefetchers for a given application. This technique is based on program characterization and utilizes hardware performance events in conjunction with a pruning algorithm to obtain a concise and expressive feature set. The resulting feature set is used in three different learning models. All necessary steps are implemented in a framework that reaches, on average, 96% of the best possible prefetcher speedup. The framework is built from open-source tools, making it easy to extend and port to other architectures.


extreme science and engineering discovery environment | 2014

A SIMD Solution for the Quadratic Assignment Problem with GPU Acceleration

Abhilash Chaparala; Clara Novoa; Apan Qasem

In the Quadratic Assignment Problem (QAP), n units (usually departments, machines, or electronic components) must be assigned to n locations given the distance between the locations and the flow between the units. The goal is to find the assignment that minimizes the sum of the products of distance traveled and flow between units. The QAP is a combinatorial problem difficult to solve to optimality even for problems where n is relatively small (e.g., n = 30). In this paper, we solve the QAP problem using a parallel algorithm that employs a 2-opt heuristic and leverages the compute capabilities of current GPUs. The algorithm is implemented on the Stampede cluster hosted by the Texas Advanced Computing Center (TACC) at the University of Texas at Austin and on a GPU-equipped server housed at Texas State University. We enhance our implementation by fine tuning the occupancy levels and by exploiting inter-thread data locality through improved shared memory allocation. On a series of experiments on the well-known QAPLIB data sets, our algorithm, on average, outperforms an OpenMP implementation by a factor of 16.31 and a Tabu search based GPU implementation by a factor of 58.61. Given the wide applicability of QAP, the algorithm we propose has very good potential to accelerate the discovery in scholarly research in Engineering, particularly in the fields of Operations Research and design of electronic devices.


green technologies conference | 2012

Improved Energy Efficiency for Multithreaded Kernels through Model-Based Autotuning

Apan Qasem; Michael Jason Cade; Dan E. Tamir

In the last few years, the emergence of multicore architectures has revolutionized the landscape of high-performance computing. The multicore shift has not only increased the per-node performance potential of computer systems but also has made great strides in curbing power and heat dissipation. As we look to the future, however, the gains in performance and energy consumption is not going to come from hardware alone. Software needs to play a key role in achieving a high fraction of peak and keeping the energy consumption within the desired envelope. To attain this goal, performance-enhancing and energy-conserving software needs to carefully orchestrate many architecture-sensitive parameters. In particular, the presence of shared-caches on multicore architectures makes it necessary to consider, in concert, issues related to both parallelism and data locality to achieve the desired power-performance ratio. This paper studies the complex interaction among several code transformations that affect data locality, problem decomposition and selection of loops for parallelism. We characterize this interaction using static compiler analysis and generate a pruned search space suitable for efficient autotuning. We also extend a heuristic based on number of threads, data reuse patterns, and the size and configuration of the shared cache, to estimate good synchronization interval for conserving energy in parallel code. We validate our choice of tuning parameters and evaluate our heuristic with experiments on a set of scientific and engineering kernels on four different multicore platforms. Results of the experimental study reveal several interesting properties of the transformation search space and demonstrate the effectiveness of the heuristic in predicting good synchronization intervals that reduce energy consumption without a significant degradation in performance.


acm southeast regional conference | 2009

A case for compiler-driven superpage allocation

Joshua Magee; Apan Qasem

Most modern microprocessor-based systems provide support for superpages both at the hardware and software level. Judicious use of superpages can significantly cut down the number of TLB misses and improve overall system performance. However, indiscriminate superpage allocation results in page fragmentation and increased application footprint, which often outweigh the benefits of reduced TLB misses. Previous research has explored policies for smart allocation of superpages from an operating systems perspective. This paper presents a compiler-based strategy for automatic and profitable memory allocation via superpages. A significant advantage of a compiler-based approach is the availability of data-reuse information within an application. Our strategy employs data-locality analysis to estimate the TLB demands of a program and uses this metric to determine if the program will benefit from superpage allocation. Apart from its obvious utility in improving TLB performance, this strategy can be used to improve the effectiveness of certain data-layout transformations and can be a useful tool in benchmarking and empirical tuning. We demonstrate the effectiveness of this strategy with experiments on an Intel Core 2 Duo with a two-level TLB.

Collaboration


Dive into the Apan Qasem's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Clara Novoa

Texas State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Qing Yi

University of Colorado Colorado Springs

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge