Perhaad Mistry | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Perhaad Mistry is active.

Explore More

Publication

Featured researches published by Perhaad Mistry.

international conference on parallel architectures and compilation techniques | 2012

Multi2Sim: a simulation framework for CPU-GPU computing

Rafael Ubal; Byunghyun Jang; Perhaad Mistry; Dana Schaa; David R. Kaeli

Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMDs OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed users guide are publicly available at www.multi2sim.org.

IEEE Transactions on Parallel and Distributed Systems | 2011

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Byunghyun Jang; Dana Schaa; Perhaad Mistry; David R. Kaeli

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

general purpose processing on graphics processing units | 2011

Analyzing program flow within a many-kernel OpenCL application

Perhaad Mistry; Chris Gregg; Norman Rubin; David R. Kaeli; Kim M. Hazelwood

Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and profile the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. Our profiling framework is developed using built-in OpenCL API function calls, without the need for an external profiler. We show we can begin to identify performance bottlenecks and performance issues present in individual components on different hardware platforms. We demonstrate that by using run-time profiling using the OpenCL specification, we can provide an application developer with a fine-grained look at performance, and that this information can be used to tailor performance improvements for specific platforms.

acm sigplan symposium on principles and practice of parallel programming | 2010

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Byunghyun Jang; Perhaad Mistry; Dana Schaa; Rodrigo Dominguez; David R. Kaeli

Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.

international conference on performance engineering | 2015

NUPAR: A Benchmark Suite for Modern GPU Architectures

Yash Ukidave; Fanny Nina Paravecino; Leiming Yu; Charu Kalra; Amir Momeni; Zhongliang Chen; Nick Materise; Brett Daley; Perhaad Mistry; David R. Kaeli

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new interfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimization is not only limited to effectively exploiting data-level parallelism, but includes leveraging new degrees of concurrency and parallelism to accelerate the entire application. To aid hardware architects and application developers in effectively tuning performance on GPUs, we have developed the NUPAR benchmark suite. The NUPAR applications belong to a number of different scientific and commercial computing domains. These benchmarks exhibit a range of GPU computing characteristics that consider memory-bandwidth limitations, device occupancy and resource utilization, synchronization latency and device-specific compute optimizations. The NUPAR applications are specifically designed to stress new hardware and software features that include: nested parallelism, concurrent kernel execution, shared host-device memory and new instructions for precise computation and data movement. In this paper, we focus our discussion on applications developed in CUDA and OpenCL, and focus on high-end server class GPUs. We describe these benchmarks and evaluate their interaction with different architectural features on a GPU. Our evaluation examines the behavior of the advanced hardware features on recently-released GPU architectures.

Heterogeneous Computing with OpenCL | 2013

Introduction to OpenCL

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter introduces OpenCL, the programming fabric that allows one to weave application to execute concurrently. It provides an introduction to the basics of using the OpenCL standard when developing parallel programs. It describes the four different abstraction models defined in the standard and presented examples of OpenCL implementations to place some of the abstraction in context. OpenCL describes execution in fine-grained work-items and can dispatch vast numbers of work-items on architectures with hardware support for fine-grained threading. It is easy to have concerns about scalability. The hierarchical concurrency model implemented by OpenCL ensures that scalable execution can be achieved even while supporting a large number of work items. Work items within a workgroup have a special relationship with one another. They can perform barrier operations to synchronize and they have access to a shared memory address space.

international symposium on performance analysis of systems and software | 2013

Quantifying the energy efficiency of FFT on heterogeneous platforms

Yash Ukidave; Amir Kavyan Ziabari; Perhaad Mistry; Gunar Schirner; David R. Kaeli

Heterogeneous computing using Graphic Processing Units (GPUs) has become an attractive computing model given the available scale of data-parallel performance and programming standards such as OpenCL. However, given the energy issues present with GPUs, some devices can exhaust power budgets quickly. Better solutions are needed to effectively exploit the power efficiency available on heterogeneous systems. In this paper we evaluate the power-performance trade-offs of different heterogeneous signal processing applications. More specifically, we compare the performance of 7 different implementations of the Fast Fourier Transform algorithms. Our study covers discrete GPUs and shared memory GPUs (APUs) from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Fermi) and Intel (Ivy Bridge). For this range of platforms, we characterize the different FFTs and identify the specific architectural features that most impact power consumption. Using the 7 FFT kernels, we obtain a 48% reduction in power consumption and up to a 58% improvement in performance across these different FFT implementations. These differences are also found to be target architecture dependent. The results of this study will help the signal processing community identify which class of FFTs are most appropriate for a given platform. More important, we have demonstrated that different algorithms implementing the same fundamental function (FFT) can perform vastly different based on the target hardware and associated programming optimizations.

architectural support for programming languages and operating systems | 2013

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Perhaad Mistry; Yash Ukidave; Dana Schaa; David R. Kaeli

Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a CPU and GPU on the same chip). These emerging class of platforms are now being targeted to accelerate applications where the host processor (typically a CPU) and compute device (typically a GPU) co-operate on a computation. In this scenario, the performance of the application is not only dependent on the processing power of the respective heterogeneous processors, but also on the efficient interaction and communication between them. To help architects and application developers to quantify many of the key aspects of heterogeneous execution, this paper presents a new set of benchmarks called the Valar. The Valar benchmarks are applications specifically chosen to study the dynamic behavior of OpenCL applications that will benefit from host-device interaction. We describe the general characteristics of our benchmarks, focusing on specific characteristics that can help characterize heterogeneous applications. For the purposes of this paper we focus on OpenCL as our programming environment, though we envision versions of Valar in additional heterogeneous programming languages. We profile the Valar benchmarks based on their mapping and execution on different heterogeneous systems. Our evaluation examines optimizations for host-device communication and the effects of closely-coupled execution of the benchmarks on the multiple OpenCL devices present in heterogeneous systems.

design automation conference | 2014

Exploring the Heterogeneous Design Space for both Performance and Reliability

Rafael Ubal; Dana Schaa; Perhaad Mistry; Xiang Gong; Yash Ukidave; Zhongliang Chen; Gunar Schirner; David R. Kaeli

As we move into a new era of heterogeneous multi-core systems, our ability to tune the performance and understand the reliability of both hardware and software becomes more challenging. Given the multiplicity of different design trade-offs in hardware and software, and the rate of introduction of new architectures and hardware/-software features, it becomes difficult to properly model emerging heterogeneous platforms. In this paper we present a new methodology to address these challenges in a flexible and extensible framework. We describe the design of a framework that supports a range of heterogeneous devices to be evaluated based on different performance/reliability criteria. We address heterogeneity both in hardware and software, providing a flexible framework that can be easily adapted and extended as new elements in the SoC stack continue to evolve. Our framework enables modeling at different levels of abstraction and interfaces to existing tools to compose hybrid modeling environments. We also consider the role of software, providing a flexible and modifiable compiler stack based on LLVM. We provide examples that highlight both the flexibility of this framework and demonstrate the utility of the tools.

GPU Computing Gems Emerald Edition | 2011

GPU Acceleration of Iterative Digital Breast Tomosynthesis

Dana Schaa; Benjamin Brown; Byunghyun Jang; Perhaad Mistry; Rodrigo Dominguez; David R. Kaeli; Richard Moore; Daniel B. Kopans

Publisher Summary This chapter explores the acceleration of a DBT algorithm on multiple CUDA-enabled GPUs, reducing the execution time to under 20 seconds for eight iterations. Iterative digital breast tomosynthesis (DBT) is a technology that mitigates many of the shortcomings associated with traditional mammography. DBT takes a series of X-ray projections to acquire information about the breast structure. A reconstruction method is required to combine these projections into a three-dimensional image. However, the usability of DBT depends largely on making the time for computation acceptable within a clinical setting. GPU acceleration has been able to reduce both the resources and time required to perform high-quality DBT reconstructions. The performance gains obtained from using GPUs to accelerate iterative MLEM-based DBT reconstruction have been so impressive that time is no longer the dominant concern for clinical application. This work has demonstrated that DBT reconstruction can be less costly and more accurate, while also completing reconstruction in a reasonable amount of time. The algorithm studied in this chapter is representative of a general class of image reconstruction problems, as are the thread-mapping strategies, multi-GPU considerations, and optimizations employed in this work.

Explore More