Masab Ahmad | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Masab Ahmad is active.

Explore More

Publication

Featured researches published by Masab Ahmad.

ieee international symposium on workload characterization | 2015

CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores

Masab Ahmad; Farrukh Hijaz; Qingchuan Shi; Omer Khan

Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.

ieee international symposium on workload characterization | 2016

GPU concurrency choices in graph analytics

Masab Ahmad; Omer Khan

Graph analytics is becoming ever more ubiquitous in todays world. However, situational dynamic changes in input graphs, such as changes in traffic and weather patterns, lead to variations in concurrency. Moreover, graph algorithms are known to have data dependent loops and fine-grain synchronization that makes them hard to scale on parallel machines. Recent trends in computing indicate the rise of massively-threaded machines, such as Graphic Processing Units (GPUs). It is of paramount importance to adopt these graph algorithms efficiently on these GPU machines. However, concurrency variations are expected to play a formidable role in achieving good GPU performance. This paper performs an in-depth characterization of GPU architectural choices for graph benchmarks executing on a diverse set of input graphs. The analysis shows that performance improves by a geometric mean of 40% when optimal threads are spawned on a GPU relative to a naive choice that maximizes total thread count. Moreover, an additional 41% performance is achieved when the number of threads per GPU work group is reduced to a setting that optimizes exploitable hardware concurrency. It is also shown that algorithmic auto-tuning coupled with the right architectural choices co-optimize GPU performance.

ieee high performance extreme computing conference | 2015

Efficient parallelization of path planning workload on single-chip shared-memory multicores

Masab Ahmad; Kartik Lakshminarasimhan; Omer Khan

Path planning problems greatly arise in many applications where the objective is to find the shortest path from a given source to destination. In this paper, we explore the comparison of programming languages in the context of parallel workload analysis. We characterize parallel versions of path planning algorithms, such as the Dijkstras Algorithm, across C/C++ and Python languages. Programming language comparisons are done to analyze fine grain scalability and efficiency using a single-socket shared memory multicore processor. Architectural studies, such as understanding cache effects, are also undertaken to analyze bottlenecks for each parallelization strategy. Our results show that a right parallelization strategy for path planning yields scalability on a commercial multicore processor. However, several shortcomings exist in the parallel Python language that must be accounted for by HPC researchers.

hardware and architectural support for security and privacy | 2015

Exploring the performance implications of memory safety primitives in many-core processors executing multi-threaded workloads

Masab Ahmad; Syed Kamran Haider; Farrukh Hijaz; Marten van Dijk; Omer Khan

Security is a vital consideration for todays processor architectures, both at the software and hardware layers. However, security schemes are known to incur significant performance overheads. For example, buffer overflow protection schemes perform software checks for bounds on program data structures, and incur performance overheads that are up to several orders of magnitude. To mitigate these overheads, prior works focus on either changing the security scheme itself, or selectively apply the security scheme to minimize program vulnerabilities. Most of these works also focus primarily on single core processors, with no prior work done in the context of multicore processors. In this paper, we show how increasing thread counts can help hide the latency overheads of security schemes. We also analyze the architectural implications in the context of multucores, and the insights and challenges associated with applying these security schemes on mutithreaded workloads.

international conference on computer design | 2017

GraphTuner: An Input Dependence Aware Loop Perforation Scheme for Efficient Execution of Approximated Graph Algorithms

Hamza Omar; Masab Ahmad; Omer Khan

Graph algorithms have gained popularity and are utilized in high performance and mobile computing paradigms. Input dependence due to input graph changes leads to performance variations in such algorithms. The impact of input dependence for graph algorithms is not well studied in the context of approximate computing. This paper conducts such analysis by applying loop perforation, which is a general approximation mechanism that transforms the program loops to drop a subset of their total iterations. The analysis identifies the need to adapt the inner and outer loop perforation as a function of input graph characteristics, such as the density or size of the graph. A predictive model is proposed to learn the near-optimal loop perforation rates using synthetic input graphs. When the input aware loop perforation model is applied to real world graphs, the evaluated graph algorithms systematically degrade accuracy to achieve performance and power benefits. Results show ~30% performance and ~19% power utilization improvements on average at a program accuracy loss threshold of 10% for an NVidia® GPU. The analysis is also conducted for two concurrent Intel® CPU architectures, an 8-core Xeon™ and a 61-core Xeon Phi machine.

IEEE Micro | 2017

Efficient Situational Scheduling of Graph Workloads on Single-Chip Multicores and GPUs

Masab Ahmad; Chris J. Michael; Omer Khan

Situational dynamic changes in graph analytic algorithm implementations give rise to efficiency challenges in concurrent hardware, such as GPUs and large-scale multicores. These performance variations stem from input dependence, such as the density and degree of the graph being processed. Consequently, concurrency control becomes challenging, because the complex data-dependent behavior in these workloads exhibits a range of plausible algorithmic and architectural choices. This article addresses the question of how to efficiently harness the multidimensional search space of such choices for graph analytic workloads in a real-time execution environment. A key insight is that architectural choices are sufficient to yield a concurrency control setting that is comparable to the optimal setup that optimizes both algorithmic and architectural choices. The authors propose a situationally adaptive scheduler (SAS) that learns the architectural choices offline using synthetically generated graphs. SAS-assisted execution in a real-time setup provides geometric performance gains of 40 percent for a large-scale GPU (Nvidia GTX-970), 35 percent for a smaller GPU (Nvidia GTX- 750Ti), and 30 percent for a large-scale multicore (Intel Xeon Phi).

international conference on computer design | 2015

M-MAP: Multi-factor memory authentication for secure embedded processors

Syed Kamran Haider; Masab Ahmad; Farrukh Hijaz; Astha Patni; Ethan Johnson; Matthew Seita; Omer Khan; Marten van Dijk

The challenges faced in securing embedded computing systems against multifaceted memory safety vulnerabilities have prompted great interest in the development of memory safety countermeasures. These countermeasures either provide protection only against their corresponding type of vulnerabilities, or incur substantial architectural modifications and overheads in order to provide complete safety, which makes them infeasible for embedded systems. In this paper, we propose M-MAP: a comprehensive system based on multi-factor memory authentication for complete memory safety. We examine certain crucial implications of composing memory integrity verification and bounds checking schemes in a comprehensive system. Based on these implications, we implement M-MAP with hardware based memory integrity verification and software based bounds checking to achieve a balance between hardware modifications and performance. We demonstrate that M-MAP implemented on top of a lightweight out-of-order processor delivers complete memory safety with only 32% performance overhead on average, while incurring minimal hardware modifications, and area overhead.

ACM Transactions in Embedded Computing Systems | 2018

Declarative Resilience: A Holistic Soft-Error Resilient Multicore Architecture that Trades off Program Accuracy for Efficiency

Hamza Omar; Qingchuan Shi; Masab Ahmad; Halit Dogan; Omer Khan

To protect multicores from soft-error perturbations, research has explored various resiliency schemes that provide high soft-error coverage. However, these schemes incur high performance and energy overheads. We observe that not all soft-error perturbations affect program correctness, and some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from error free outcome. Thus, it is practical to improve processor efficiency by trading off resiliency overheads with program accuracy. This article proposes the idea of declarative resilience that selectively applies strong resiliency schemes for code regions that are crucial for program correctness (crucial code) and lightweight resiliency for code regions that are susceptible to program accuracy deviations as a result of soft-errors (non-crucial code). At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. A cross-layer architecture enables efficient resilience along with holistic soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of machine-learning and graph analytic benchmarks, declarative resilience reduces performance overhead over a state-of-the-art system that applies strong resiliency for all program code regions from ∼ 1.43× to ∼ 1.2×.

international parallel and distributed processing symposium | 2017

Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging

Halit Dogan; Farrukh Hijaz; Masab Ahmad; Brian Kahne; Peter J. Wilson; Omer Khan

Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.

IEEE Transactions on Dependable and Secure Computing | 2017