Vignesh T. Ravi
Ohio State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vignesh T. Ravi.
international conference on supercomputing | 2010
Vignesh T. Ravi; Wenjing Ma; David Chiu; Gagan Agrawal
A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of todays heterogeneous machines. This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x relative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.
high performance distributed computing | 2011
Vignesh T. Ravi; Michela Becchi; Gagan Agrawal; Srimat T. Chakradhar
Driven by the emergence of GPUs as a major player in high performance computing and the rapidly growing popularity of cloud environments, GPU instances are now being offered by cloud providers. The use of GPUs in a cloud environment, however, is still at initial stages, and the challenge of making GPU a true shared resource in the cloud has not yet been addressed. This paper presents a framework to enable applications executing within virtual machines to transparently share one or more GPUs. Our contributions are twofold: we extend an open source GPU virtualization software to include efficient GPU sharing, and we propose solutions to the conceptual problem of GPU kernel consolidation. In particular, we introduce a method for computing the affinity score between two or more kernels, which provides an indication of potential performance improvements upon kernel consolidation. In addition, we explore molding as a means to achieve efficient GPU sharing also in the case of kernels with high or conflicting resource requirements. We use these concepts to develop an algorithm to efficiently map a set of kernels on a pair of GPUs. We extensively evaluate our framework using eight popular GPU kernels and two Fermi GPUs. We find that even when contention is high our consolidation algorithm is effective in improving the throughput, and that the runtime overhead of our framework is low.
grid computing | 2010
Wei Jiang; Vignesh T. Ravi; Gagan Agrawal
Map-reduce framework has received a significant attention and is being used for programming both large-scale clusters and multi-core systems. While the high productivity aspect of map-reduce has been well accepted, it is not clear if the API results in efficient implementations for different subclasses of data-intensive applications. In this paper, we present a system MATE (Map-reduce with an Alternate API), that provides a high-level, but distinct API. Particularly, our API includes a programmer-managed reduction object, which results in lower memory requirements at runtime for many data-intensive applications. MATE implements this API on top of the Phoenix system, a multi-core map-reduce implementation from Stanford. We evaluate our system using three data mining applications, and compare its performance to that of both Phoenix and Hadoop. Our results show that for all the three applications, MATE outperforms Phoenix and Hadoop. Despite achieving good scalability, MATE also maintains the easy-to-use API of map-reduce. Overall, we argue that, our approach, which is based on the generalized reduction structure, provides an alternate high-level API, leading to more efficient and scalable implementations
high performance distributed computing | 2012
Michela Becchi; Kittisak Sajjapongse; Ian Graves; Adam M. Procter; Vignesh T. Ravi; Srimat T. Chakradhar
Graphics Processing Units (GPUs) are increasingly becoming part of HPC clusters. Nevertheless, cloud computing services and resource management frameworks targeting heterogeneous clusters including GPUs are still in their infancy. Further, GPU software stacks (e.g., CUDA driver and runtime) currently provide very limited support to concurrency. In this paper, we propose a runtime system that provides abstraction and sharing of GPUs, while allowing isolation of concurrent applications. A central component of our runtime is a memory manager that provides a virtual memory abstraction to the applications. Our runtime is flexible in terms of scheduling policies, and allows dynamic (as opposed to programmer-defined) binding of applications to GPUs. In addition, our framework supports dynamic load balancing, dynamic upgrade and downgrade of GPUs, and is resilient to their failures. Our runtime can be deployed in combination with VM-based cloud computing services to allow virtualization of heterogeneous clusters, or in combination with HPC cluster resource managers to form an integrated resource management infrastructure for heterogeneous clusters. Experiments conducted on a three-node cluster show that our GPU sharing scheme allows up to a 28% and a 50% performance improvement over serialized execution on short- and long-running jobs, respectively. Further, dynamic inter-node load balancing leads to an additional 18-20% performance benefit.
acm sigplan symposium on principles and practice of parallel programming | 2011
Mai Zheng; Vignesh T. Ravi; Feng Qin; Gagan Agrawal
In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased GPU programming for non-graphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. Thus, tool support for detecting race conditions can significantly benefit GPU application developers. Existing approaches for detecting data races on CPUs or GPUs have one or more of the following limitations: 1) being illsuited for handling non-lock synchronization primitives on GPUs; 2) lacking of scalability due to the state explosion problem; 3) reporting many false positives because of simplified modeling; and/or 4) incurring prohibitive runtime and space overhead. In this paper, we propose GRace, a new mechanism for detecting races in GPU programs that combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GRace can accurately detect data races with no false positives reported. Based on the above idea, we have built a prototype of GRace with two schemes, i.e., GRace-stmt and GRace-addr, for NVIDIA GPUs. Both schemes are integrated with the same static analysis. We have evaluated GRace-stmt and GRace-addr with three data race bugs in three GPU kernel functions and also have compared them with the existing approach, referred to as B-tool. Our experimental results show that both schemes of GRace are effective in detecting all evaluated cases with no false positives, whereas Btool reports many false positives for one evaluated case. On the one hand, GRace-addr incurs low runtime overhead, i.e., 22-116%, and low space overhead, i.e., 9-18MB, for the evaluated kernels. On the other hand, GRace-stmt offers more help in diagnosing data races with larger overhead.
cluster computing and the grid | 2012
Vignesh T. Ravi; Michela Becchi; Wei Jiang; Gagan Agrawal; Srimat T. Chakradhar
Heterogeneous architectures comprising a multicore CPU and many-core GPU(s) are increasingly being used within cluster and cloud environments. In this paper, we study the problem of optimizing the overall throughput of a set of applications deployed on a cluster of such heterogeneous nodes. We consider two different scheduling formulations. In the first formulation, we consider jobs that can be executed on either the GPU or the CPU of a single node. In the second formulation, we consider jobs that can be executed on the CPU, GPU, or both, of any number of nodes in the system. We have developed scheduling schemes addressing both of the problems. In our evaluation, we first show that the schemes proposed for first formulation outperform a blind round-robin scheduler and approximate the performances of an ideal scheduler that involves an impractical exhaustive exploration of all possible schedules. Next, we show that the scheme proposed for the second formulation outperforms the best of existing schemes for heterogeneous clusters, TORQUE and MCT, by up to 42%.
brazilian conference on intelligent systems | 2014
Peter E. Bailey; David K. Lowenthal; Vignesh T. Ravi; Barry Rountree; Martin Schulz; Bronis R. de Supinski
As power becomes an increasingly important design factor in high-end supercomputers, future systems will likely operate with power limitations significantly below their peak power specifications. These limitations will be enforced through a combination of software and hardware power policies, which will filter down from the system level to individual nodes. Hardware is already moving in this direction by providing power-capping interfaces to the user. The power/performance trade-off at the node level is critical in maximizing the performance of power-constrained cluster systems, but is also complex because of the many interacting architectural features and accelerators that comprise the hardware configuration of a node. The key to solving this challenge is an accurate power/performance model that will aid in selecting the right configuration from a large set of available configurations. In this paper, we present a novel approach to generate such a model offline using kernel clustering and multivariate linear regression. Our model requires only two iterations to select a configuration, which provides a significant advantage over exhaustive search-based strategies. We apply our model to predict power and performance for different applications using arbitrary configurations, and show that our model, when used with hardware frequency-limiting, selects configurations with significantly higher performance at a given power limit than those chosen by frequency-limiting alone. When applied to a set of 36 computational kernels from a range of applications, our model accurately predicts power and performance, it maintains 91% of optimal performance while meeting power constraints 88% of the time. When the model violates a power constraint, it exceeds the constraint by only 6% in the average case, while simultaneously achieving 54% more performance than an oracle.
ieee international conference on high performance computing, data, and analytics | 2011
Vignesh T. Ravi; Gagan Agrawal
A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Recently, it has become very common for a desktop or a notebook computer to be equipped with both a multi-core CPU and a GPU. Application development for exploiting the aggregate computing power of such an environment is a major challenge today. Particularly, we need dynamic work distribution schemes that are adaptable to different computation and communication patterns in applications, and to various heterogeneous configurations. This paper describes a general dynamic scheduling framework for mapping applications with different communication patterns to heterogeneous architectures. We first make key observations about the architectural tradeoffs among heterogeneous resources and the communication pattern of an application, and then infer constraints for the dynamic scheduler. We then present a novel cost model for choosing the optimal chunk size in a heterogeneous configuration. Finally, based on general framework and cost model we provide optimized work distribution schemes to further improve the performance.
ieee international conference on high performance computing, data, and analytics | 2011
Xin Huo; Vignesh T. Ravi; Gagan Agrawal
Heterogeneous architectures are playing a significant role in High Performance Computing (HPC) today, with the popularity of accelerators like the GPUs, and the new trend towards the integration of CPUs and GPUs. Developing applications that can effectively use these architectures is a major challenge. In this paper, we focus on one of the dwarfs in the Berkeley view on parallel computing, which are the irregular applications arising from unstructured grids. We consider the problem of executing these reductions on heterogeneous architectures comprising a multi-core CPU and a GPU. We have developed a Multi-level Partitioning Framework, which has the following features: 1) it supports GPU execution of irregular reductions even when the dataset size exceeds the size of the device memory, 2) it can enable pipelining of partitioning performed on the CPU, and the computations on the GPU, and 3) it supports dynamic distribution of work between the multi-core CPU and the GPU. Our extensive evaluation using two different irregular applications demonstrates the effectiveness of our approach.
international conference on cluster computing | 2009
Wei Jiang; Vignesh T. Ravi; Gagan Agrawal
Map-reduce has been a topic of much interest in the last 2–3 years. While it is well accepted that the map-reduce APIs enable significantly easier programming, the performance aspects of the use of map-reduce are less well understood. This paper focuses on comparing the map-reduce paradigm with a system that was developed earlier at Ohio State, FREERIDE (FRamework for Rapid Implementation of Datamining Engines). The API and the functionality offered by FREERIDE has many similarities with the map-reduce API. However, there are some differences in the API. Moreover, while FREERIDE was motivated by data mining computations, map-reduce was motivated by searching, sorting, and related applications in a data-center. We compare the programming APIs and performance of the Hadoop implementation of map-reduce with FREERIDE. For our study, we have taken three data mining algorithms, which are k-means clustering, apriori association mining, and k-nearest neighbor search. We have also included a simple data scanning application, word-count. The main observations from our results are as follows. For the three data mining applications we have considered, FREERIDE outperformed Hadoop by a factor of 5 or more. For word-count, Hadoop is better by a factor of up to 2. With increasing dataset sizes, the relative performance of Hadoop becomes better. Overall, it seems that Hadoop has significant overheads related to initialization, I/O, and sorting of (key, value) pairs. Thus, despite an easy to program API, Hadoops map-reduce does not appear very suitable for data mining computations on modest-sized datasets.