Yash Ukidave | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yash Ukidave is active.

Explore More

Publication

Featured researches published by Yash Ukidave.

international conference on performance engineering | 2015

NUPAR: A Benchmark Suite for Modern GPU Architectures

Yash Ukidave; Fanny Nina Paravecino; Leiming Yu; Charu Kalra; Amir Momeni; Zhongliang Chen; Nick Materise; Brett Daley; Perhaad Mistry; David R. Kaeli

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new interfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimization is not only limited to effectively exploiting data-level parallelism, but includes leveraging new degrees of concurrency and parallelism to accelerate the entire application. To aid hardware architects and application developers in effectively tuning performance on GPUs, we have developed the NUPAR benchmark suite. The NUPAR applications belong to a number of different scientific and commercial computing domains. These benchmarks exhibit a range of GPU computing characteristics that consider memory-bandwidth limitations, device occupancy and resource utilization, synchronization latency and device-specific compute optimizations. The NUPAR applications are specifically designed to stress new hardware and software features that include: nested parallelism, concurrent kernel execution, shared host-device memory and new instructions for precise computation and data movement. In this paper, we focus our discussion on applications developed in CUDA and OpenCL, and focus on high-end server class GPUs. We describe these benchmarks and evaluate their interaction with different architectural features on a GPU. Our evaluation examines the behavior of the advanced hardware features on recently-released GPU architectures.

international symposium on performance analysis of systems and software | 2013

Quantifying the energy efficiency of FFT on heterogeneous platforms

Yash Ukidave; Amir Kavyan Ziabari; Perhaad Mistry; Gunar Schirner; David R. Kaeli

Heterogeneous computing using Graphic Processing Units (GPUs) has become an attractive computing model given the available scale of data-parallel performance and programming standards such as OpenCL. However, given the energy issues present with GPUs, some devices can exhaust power budgets quickly. Better solutions are needed to effectively exploit the power efficiency available on heterogeneous systems. In this paper we evaluate the power-performance trade-offs of different heterogeneous signal processing applications. More specifically, we compare the performance of 7 different implementations of the Fast Fourier Transform algorithms. Our study covers discrete GPUs and shared memory GPUs (APUs) from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Fermi) and Intel (Ivy Bridge). For this range of platforms, we characterize the different FFTs and identify the specific architectural features that most impact power consumption. Using the 7 FFT kernels, we obtain a 48% reduction in power consumption and up to a 58% improvement in performance across these different FFT implementations. These differences are also found to be target architecture dependent. The results of this study will help the signal processing community identify which class of FFTs are most appropriate for a given platform. More important, we have demonstrated that different algorithms implementing the same fundamental function (FFT) can perform vastly different based on the target hardware and associated programming optimizations.

architectural support for programming languages and operating systems | 2013

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Perhaad Mistry; Yash Ukidave; Dana Schaa; David R. Kaeli

Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a CPU and GPU on the same chip). These emerging class of platforms are now being targeted to accelerate applications where the host processor (typically a CPU) and compute device (typically a GPU) co-operate on a computation. In this scenario, the performance of the application is not only dependent on the processing power of the respective heterogeneous processors, but also on the efficient interaction and communication between them. To help architects and application developers to quantify many of the key aspects of heterogeneous execution, this paper presents a new set of benchmarks called the Valar. The Valar benchmarks are applications specifically chosen to study the dynamic behavior of OpenCL applications that will benefit from host-device interaction. We describe the general characteristics of our benchmarks, focusing on specific characteristics that can help characterize heterogeneous applications. For the purposes of this paper we focus on OpenCL as our programming environment, though we envision versions of Valar in additional heterogeneous programming languages. We profile the Valar benchmarks based on their mapping and execution on different heterogeneous systems. Our evaluation examines optimizations for host-device communication and the effects of closely-coupled execution of the benchmarks on the multiple OpenCL devices present in heterogeneous systems.

international parallel and distributed processing symposium | 2016

Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning

Yash Ukidave; Xiangyu Li; David R. Kaeli

GPUs have become the primary choice of accelerators for high-end data centers and cloud servers, which can host thousands of disparate applications. With the growing demands for GPUs on clusters, there arises a need for efficient co-execution of applications on the same accelerator device. However, the resource contention among co-executing applications causes interference which leads to degradation in execution performance, impacts QoS requirements of applications and lowers overall system throughput. While previous work has proposed techniques for detecting interference, the existing solutions are either developed for CPU clusters, or use static profiling approaches which can be computationally intensive and do not scale well. We present Mystic, an interference-aware scheduler for efficient co-execution of applications on GPU-based clusters and cloud servers. The most important feature of Mystic is the use of learning-based analytical models for detecting interference between applications. We leverage a collaborative filtering framework to characterize an incoming application with respect to the interference it may cause when co-executing with other applications while sharing GPU resources. Mystic identifies the similarities between new applications and the executing applications, and guides the scheduler to minimize the interference and improve system throughput. We train the learning model with 42 CUDA applications, and consider another separate set of 55 diverse, real-world GPU applications for evaluation. Mystic is evaluated on a live GPU cluster with 32 NVIDIA GPUs. Our framework achieves performance guarantees for 90.3% of the evaluated applications. When compared with state-of-the art interference-oblivious schedulers, Mystic improves the system throughput by 27.5% on average, and achieves a 16.3% improvement on average in GPU utilization.

design automation conference | 2014

Exploring the Heterogeneous Design Space for both Performance and Reliability

Rafael Ubal; Dana Schaa; Perhaad Mistry; Xiang Gong; Yash Ukidave; Zhongliang Chen; Gunar Schirner; David R. Kaeli

As we move into a new era of heterogeneous multi-core systems, our ability to tune the performance and understand the reliability of both hardware and software becomes more challenging. Given the multiplicity of different design trade-offs in hardware and software, and the rate of introduction of new architectures and hardware/-software features, it becomes difficult to properly model emerging heterogeneous platforms. In this paper we present a new methodology to address these challenges in a flexible and extensible framework. We describe the design of a framework that supports a range of heterogeneous devices to be evaluated based on different performance/reliability criteria. We address heterogeneity both in hardware and software, providing a flexible framework that can be easily adapted and extended as new elements in the SoC stack continue to evolve. Our framework enables modeling at different levels of abstraction and interfaces to existing tools to compose hybrid modeling environments. We also consider the role of software, providing a flexible and modifiable compiler stack based on LLVM. We provide examples that highlight both the flexibility of this framework and demonstrate the utility of the tools.

symposium on computer architecture and high performance computing | 2014

Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs

Yash Ukidave; Charu Kalra; David R. Kaeli; Perhaad Mistry; Dana Schaa

GPUs have gained tremendous popularity in a broad range of application domains. These applications possess varying grains of parallelism and place high demands on compute resources -- many times imposing real-time constraints, requiring flexible work schedules, and relying on concurrent execution of multiple kernels on the device. These requirements present a number of challenges when targeting current GPUs. To support this class of applications, and to take full advantage of the large number of compute cores present on the GPU, we need a new mechanism to support concurrent execution and provide flexible mapping of compute kernels to the GPU. In this paper, we describe a new scheduling mechanism for dynamic spatial partitioning of the GPU, which adapts to the current execution state of compute workloads on the device. To enable this functionality, we extend the OpenCL runtime environment to map multiple command queues to a single device, and effectively partitioning the device. The result is that kernels that can benefit from concurrent execution on a partitioned device can effectively utilize the full compute resources on the GPU. To accelerate next-generation workloads, we also support an inter-kernel communication mechanism that enables concurrent kernels to interact in a producer-consumer relationship. The proposed partitioning mechanism is evaluated using real world applications taken from signal and image processing, linear algebra, and data mining domains. For these performance-hungry applications we achieve a 3.1X performance speedup using a combination of the proposed scheduling scheme and inter-kernel communication, versus relying on the conventional GPU runtime.

ieee international conference on high performance computing data and analytics | 2014

Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms

Yash Ukidave; Amir Kavyan Ziabari; Perhaad Mistry; Gunar Schirner; David R. Kaeli

Graphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high performance computing domains. The availability of programming standards such as OpenCL are used to leverage the inherent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted to heterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs, platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of different optimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energy consumption of the optimizations. We compare the performance of different optimization techniques on four different fast Fourier transform implementations. Our study covers discrete GPUs, shared memory GPUs (APUs) and low power system-on-chip (SoC) devices, and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler), Intel (Ivy Bridge) and Qualcomm (Snapdragon S4) as test platforms. The study identifies the architectural and algorithmic factors which can most impact power consumption. We explore a range of application optimizations which show an increase in power consumption by 27%, but result in more than 1.8 × increase in speed of performance. We observe up to an 18% reduction in power consumption due to reduced kernel calls across FFT implementations. We also observe an 11% variation in energy consumption among different optimizations. We highlight how different optimizations can improve the execution performance of a heterogeneous application, but also impact the power efficiency of the application. More importantly, we demonstrate that different algorithms implementing the same fundamental function (FFT) can perform with vast differences based on the target hardware and associated application design.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Analyzing Optimization Techniques for Power Efficiency on Heterogeneous Platforms

Yash Ukidave; David R. Kaeli

Graphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high performance computing domains. The availability of programming standards such as OpenCL are used to leverage the inherent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted to heterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs, platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of different optimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energy consumption of the optimizations. We compare the performance of different optimization techniques on 4 different Fast Fourier Transform implementations. Our study covers discrete GPUs and shared memory GPUs (APUs), and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler) and Intel (Ivy Bridge) as test platforms. The study identifies the architectural and algorithmic factors which can most impact power consumption. We explore arange of application optimizations which show an increase in power consumption by 27%, but result in more than a 1.8Xspeedup in performance. We observe a 11% variation in energy consumption among different optimizations. We highlight how different optimizations can improve the execution performance of a heterogeneous application, but also impact power efficiency of the application.

international workshop on opencl | 2015

Exploring the features of OpenCL 2.0

Saoni Mukherjee; Xiang Gong; Leiming Yu; Carter McCardwell; Yash Ukidave; Tuan Dao; Fanny Nina Paravecino; David R. Kaeli

The growth in demand for heterogeneous accelerators has stimulated the development of cutting-edge features in newer accelerators. The heterogeneous programming frameworks such as OpenCL have matured over the years and have introduced new software features for developers. We explore one of these programming frameworks, OpenCL 2.0. To drive our study, we consider a number new features in OpenCL 2.0 using four popular applications from a range of computing domains including signal processing, cybersecurity and machine learning. These applications include: 1) the AES-128 encryption standard, 2) Finite Impulse Response filtering, 3) Infinite Impulse Response filtering, and 4) Hidden Markov model. In this work, we introduce the latest runtime features enabled in OpenCL 2.0, and discuss how well our sample applications can benefit from some of these features.

architectural support for programming languages and operating systems | 2014

Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Yash Ukidave; Xiang Gong; David R. Kaeli

Graphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance computing domains. The parallelism offered by the GPUs is used for simultaneous processing of compute and graphics by applications belonging to a range of domains. The availability of programming standards such as OpenCL and OpenGL has been leveraged to achieve the compute-graphics interoperability in the same application. However, given the increasing demands in both compute and graphics for emerging scientific visualization and immersive gaming applications, degradation in efficiency can be seen due to the continual switching between compute/graphics, swapping in and out of their associated runtime environments. We need to better understand how to tune this interoperable environment in order to allow compute and graphics to run both efficiently and simultaneously. Presently we evaluate each of these domains in isolation. In this paper, we evaluate the performance and efficiency of the OpenCL-OpenGL(CL-GL) interoperability(interop) mode. We explore different methods to improve the execution performance of the CL-GL interop-based applications. We propose a slot-based rendering mechanism for CL-GL interop to increase the efficiency of the application. To evaluate CL-GL and our slot-based scheme, we study five scientific applications using OpenCL and OpenGL for compute and graphics rendering. Our study covers two AMD Radeon discrete GPUs and one shared memory AMD APU as test platforms. We demonstrate that leveraging the CL-GL interop interface results in a 2.2X performance increase, and our slot-based rendering provides 60% increase in performance by providing a 24% improvement in L2 cache hit rate on GPUs and APUs.

Explore More