Naila Farooqui | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Naila Farooqui is active.

Explore More

Publication

Featured researches published by Naila Farooqui.

general purpose processing on graphics processing units | 2011

A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

Naila Farooqui; Andrew Kerr; Gregory Frederick Diamos; Sudhakar Yalamanchili; Karsten Schwan

In this paper we present the design and implementation of a dynamic instrumentation infrastructure for PTX programs that procedurally transforms kernels and manages related data structures. We show how performing instrumentation within the GPU Ocelot dynamic compiler infrastructure provides unique capabilities not available to other profiling and instrumentation toolchains for GPU computing. We demonstrate the utility of this instrumentation capability with three example scenarios - (1) performing workload characterization accelerated by a GPU, (2) providing load imbalance information for use by a resource allocator, and (3) providing compute utilization feedback to be used online by a simulated process scheduler that might be found in a hypervisor. Additionally, we measure both (1) the compilation overheads of performing dynamic compilation and (2) the increases in runtimes when executing instrumented kernels. On average, compilation overheads due to instrumentation consisted of 69% of the time needed to parse a kernel module, in the case of the Parboil benchmark suite. Slowdowns for instrumenting each basic block ranged from 1.5x to 5.5x, with the largest slowdowns attributed to kernels with large numbers of short, compute-bound blocks.

architectural support for programming languages and operating systems | 2014

Efficient Instrumentation of GPGPU Applications Using Information Flow Analysis and Symbolic Execution

Naila Farooqui; Karsten Schwan; Sudhakar Yalamanchili

Dynamic instrumentation of GPGPU binaries makes possible real-time introspection methods for performance debugging, correctness checks, workload characterization, and runtime optimization. Such instrumentation involves inserting code at the instruction level of an application, while the application is running, thereby able to accurately profile data-dependent application behavior. Runtime overheads seen from instrumentation, however, can obviate its utility. This paper shows how a combination of information flow analysis and symbolic execution can be used to alleviate these overheads. The methods and their effectiveness are demonstrated for a variety of GPGPU codes written in OpenCL that run on AMD GPU target backends. Kernels that can be analyzed entirely via symbolic execution need not be instrumented, thus eliminating kernel runtime overheads altogether. For the remaining GPU kernels, our results show 5-38% improvements in kernel runtime overheads.

symposium on code generation and optimization | 2016

A black-box approach to energy-aware scheduling on integrated CPU-GPU systems

Rajkishore Barik; Naila Farooqui; Brian T. Lewis; Chunling Hu; Tatiana Shpeisman

Energy efficiency is now a top design goal for all computing systems, from fitness trackers and tablets, where it affects battery life, to cloud computing centers, where it directly impacts operational cost, maintainability, and environmental impact. Todays widespread integrated CPU-GPU processors combine a CPU and a GPU compute device with different powerperformance characteristics. For these integrated processors, hardware vendors implement automatic power management policies that are typically not exposed to the end-user. Furthermore, these policies often vary between different processor generations and SKUs. As a result, it is challenging to design a generally-applicable energy-aware runtime to schedule work onto both the CPU and GPU of such integrated CPU-GPU processors to optimize energy consumption. We propose a new black-box scheduling technique to reduce energy use by effectively partitioning work across the CPU and GPU cores of integrated CPU-GPU processors. Our energy-aware scheduler combines a power model with information about the runtime behavior of a specific workload. This power model is computed once for each processor to characterize its power consumption for different kinds of workloads. On two widely different platforms, a high-end desktop system and a low-power tablet, our energy-aware runtime yields an energy-delay product that is 96% and 93%, respectively, of the near-ideal Oracle energy-delay product on a diverse set of workloads.

dependable systems and networks | 2014

HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems

Sudarsun Kannan; Naila Farooqui; Ada Gavrilovska; Karsten Schwan

Moving toward exascale, the number of GPUs in HPC machines is bound to increase, and applications will spend increasing amounts of time running on those GPU devices. While GPU usage has already led to substantial speedup for HPC codes, their failure rates due to overheating are at least 10 times higher than those seen for the CPUs now commonly used on HPC machines. This makes it increasingly important for GPUs to have robust checkpoint/restart mechanisms. This paper introduces a unified CPU-GPU checkpoint mechanism, which can efficiently checkpoint the combined GPU-CPU memory state resident on machine nodes. Efficiency is gained in part by addressing the end-to-end data movements required for check pointing - from GPU to storage - by introducing novel pre-copy and checksum methods. These methods reduce checkpoint data movement cost seen by HPC applications, with initial measurements using different benchmark applications showing up to 60% reduced checkpoint overhead. Additional exploration of the use of next-generation storage, like NVM, show further promises of reduced check pointing overheads.

acm sigplan symposium on principles and practice of parallel programming | 2016

Affinity-aware work-stealing for integrated CPU-GPU processors

Naila Farooqui; Rajkishore Barik; Brian T. Lewis; Tatiana Shpeisman; Karsten Schwan

Recent integrated CPU-GPU processors like Intels Broadwell and AMDs Kaveri support hardware CPU-GPU shared virtual memory, atomic operations, and memory coherency. This enables fine-grained CPU-GPU work-stealing, but architectural differences between the CPU and GPU hurt the performance of traditionally-implemented work-stealing on such processors. These architectural differences include different clock frequencies, atomic operation costs, and cache and shared memory latencies. This paper describes a preliminary implementation of our work-stealing scheduler, Libra, which includes techniques to deal with these architectural differences in integrated CPU-GPU processors. Libras affinity-aware techniques achieve significant performance gains over classically-implemented work-stealing. We show preliminary results using a diverse set of nine regular and irregular workloads running on an Intel Broadwell Core-M processor. Libra currently achieves up to a 2× performance improvement over classical work-stealing, with a 20% average improvement.

computing frontiers | 2016

Accelerating graph applications on integrated GPU platforms via instrumentation-driven optimizations

Naila Farooqui; Indrajit Roy; Yuan Chen; Vanish Talwar; Karsten Schwan

Integrated GPU platforms are a cost-effective and energy-efficient option for accelerating data-intensive applications. While these platforms have reduced overhead of offloading computation to the GPU and potential for fine-grained resource scheduling, there remain several open challenges. First, substantial application knowledge is required to leverage GPU acceleration capabilities. Second, static application profiling is inadequate for extracting performance from graph applications that exhibit input-dependent, irregular runtime behaviors. Third, naive scheduling of applications on both CPU and GPU devices may degrade performance due to memory contention. We describe Luminar, a runtime, profile-guided approach to accelerating applications on integrated GPU platforms. By using efficient dynamic instrumentation, Luminar informs resource scheduling about current workload properties. Luminar engenders up to 40% improvements for irregular, graph-based applications, plus 21-80% improvements in throughput and from 3-60% improvements in energy efficiency when scheduling a mix of applications.

acm sigplan symposium on principles and practice of parallel programming | 2016

A systems perspective on GPU computing: a tribute to Karsten Schwan

Naila Farooqui

Over a distinguished career, Regents Professor Karsten Schwan has made significant contributions across a diverse array of topics in computer systems, including operating systems for multi-core platforms, virtualization technologies, enterprise middleware, and high-performance computing. In this paper, we summarize his legacy of key research contributions in general-purpose GPU computing. His vision encompassed the conceptualization, implementation, and demonstration of systems abstractions and runtime methods to elevate GPUs into first-class citizens in todays and future heterogeneous computing environments. To this end, his contributions include novel scheduling and resource management abstractions, runtime specialization, and novel data management techniques to support scalable, distributed GPU frameworks.

international symposium on performance analysis of systems and software | 2012