Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Vinod Grover is active.

Publication


Featured researches published by Vinod Grover.


workshop on declarative aspects of multicore programming | 2011

Accelerating Haskell array codes with multicore GPUs

Manuel M. T. Chakravarty; Gabriele Keller; Sean Lee; Trevor L. McDonell; Vinod Grover

Current GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction, we propose a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations. We embed this purely functional array language in Haskell with an online code generator for NVIDIAs CUDA GPGPU programming environment. We regard the embedded languages collective array operations as algorithmic skeletons; our code generator instantiates CUDA implementations of those skeletons to execute embedded array programs. This paper outlines our embedding in Haskell, details the design and implementation of the dynamic code generator, and reports on initial benchmark results. These results suggest that we can compete with moderately optimised native CUDA code, while enabling much simpler source programs.


symposium on code generation and optimization | 2010

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

John A. Stratton; Vinod Grover; Jaydeep Marathe; Bastiaan Aarts; Michael Murphy; Ziang Hu; Wen-mei W. Hwu

In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs. Applications tested with our tool often showed performance parity with the compiled C version of the application for single-thread performance. With modest coarse-grained multithreading typical of todays CPU architectures, an average of 3.4x speedup on 4 processors was observed across the test applications.


symposium on code generation and optimization | 2013

Convergence and scalarization for data-parallel architectures

Krste Asanovic; Stephen W. Keckler; Yunsup Lee; Ronny Meir Krashinsky; Vinod Grover

Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31% memory address counts by 47%, and data access counts by 38%.


european conference on computer systems | 2008

Samurai: protecting critical data in unsafe languages

Karthik Pattabiraman; Vinod Grover; Benjamin G. Zorn

Programs written in type-unsafe languages such as C and C++ incur costly memory errors that result in corrupted data structures, program crashes, and incorrect results. We present a data-centric solution to memory corruption called critical memory, a memory model that allows programmers to identify and protect data that is critical for correct program execution. Critical memory defines operations to consistently read and update critical data, and ensures that other non-critical updates in the program will not corrupt it. We also present Samurai, a runtime system that implements critical memory in software. Samurai uses replication and forward error correction to provide probabilistic guarantees of critical memory semantics. Because Samurai does not modify memory operations on non-critical data, the majority of memory operations in programs run at full speed, and Samurai is compatible with third party libraries. Using both applications, including a Web server, and libraries (an STL list class and a memory allocator), we evaluate the performance overhead and fault tolerance that Samurai provides. We find that Samurai is a useful and practical approach for the majority of the applications and libraries considered.


Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming | 2014

NOVA: A Functional Language for Data Parallelism

Alexander Collins; Dominik Grewe; Vinod Grover; Sean Lee; Adriana Maria Susnea

Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit parallelism available in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in traditional imperative languages. This makes them uniquely suited for generation of efficient target code for parallel systems, such as multiple Central Processing Units (CPUs) or highly data-parallel Graphics Processing Units (GPUs). Such systems are becoming the mainstream for scientific and commodity desktop computing. Writing performance portable code for such systems using low-level languages requires significant effort from a human expert. This paper presents NOVA, a functional language and compiler for multi-core CPUs and GPUs. The NOVA language is a polymorphic, statically-typed functional language with a suite of higher-order functions which are used to express parallelism. These include map, reduce and scan. The NOVA compiler is a light-weight, yet powerful, optimizing compiler. It generates code for a variety of target platforms that achieve performance comparable to competing languages and tools, including hand-optimized code. The NOVA compiler is stand-alone and can be easily used as a target for higher-level or domain specific languages or embedded in other applications. We evaluate NOVA against two competing approaches: the Thrust library and hand-written CUDA C. NOVA achieves comparable performance to these approaches across a range of benchmarks. NOVA-generated code also scales linearly with the number of processor cores across all compute-bound benchmarks.


architectural support for programming languages and operating systems | 2012

JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units

Wojciech Zaremba; Yuan Lin; Vinod Grover

There is an increasing interest from software developers in executing Java and .NET bytecode programs on General Purpose Graphics Processor Units (GPGPUs). Existing solutions have limited support for operations on objects and often require explicit handling of memory transfers between CPU and GPU. In this paper, we describe a Java Bytecode Execution Environment (JaBEE) which supports common object-oriented constructs such as dynamic dispatch, encapsulation and object creation on GPUs. This experimental environment facilitates GPU code compilation, execution and transparent memory management. We compare the performance of our approach with CPU-based and CUDA-C-based code executions of the same programs. We discuss challenges, limitations and opportunities of bytecode execution on GPGPUs.


international symposium on microarchitecture | 2014

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

Yunsup Lee; Vinod Grover; Ronny Meir Krashinsky; Mark Stephenson; Stephen W. Keckler; Krste Asanovic

Data-parallel architectures must provide efficient support for complex control-flow constructs to support sophisticated applications coded in modern single-program multiple-data languages. As these architectures have wide data paths that process a single instruction across parallel threads, a mechanism is needed to track and sequence threads as they traverse potentially divergent control paths through the program. The design space for divergence management ranges from software-only approaches where divergence is explicitly managed by the compiler, to hardware solutions where divergence is managed implicitly by the micro architecture. In this paper, we explore this space and propose a new predication-based approach for handling control-flow structures in data-parallel architectures. Unlike prior predication algorithms, our new compiler analyses and hardware instructions consider the commonality of predication conditions across threads to improve efficiency. We prototype our algorithms in a production compiler and evaluate the tradeoffs between software and hardware divergence management on current GPU silicon. We show that our compiler algorithms make a predication-only architecture competitive in performance to one with hardware support for tracking divergence.


acm sigplan symposium on principles and practice of parallel programming | 2015

Forma: a DSL for image processing applications to target GPUs and multi-core CPUs

Mahesh Ravishankar; Justin Holewinski; Vinod Grover

As architectures evolve, optimization techniques to obtain good performance evolve as well. Using low-level programming languages like C/C++ typically results in architecture-specific optimization techniques getting entangled with the application specification. In such situations, moving from one target architecture to another usually requires a reimplementation of the entire application. Further, several compiler transformations are rendered ineffective due to implementation choices. Domain-Specific Languages (DSL) tackle both these issues by allowing developers to specify the computation at a high level, allowing the compiler to handle many tedious and error-prone tasks, while generating efficient code for multiple target architectures at the same time. Here we present Forma, a DSL for image processing applications that targets both CPUs and GPUs. The language provides syntax to express several operations like stencils, sampling, etc. which are commonly used in this domain. These can be chained together to specify complex pipelines in a concise manner. The Forma compiler is in charge of tedious tasks like memory management, data transfers from host to device, handling boundary conditions, etc. The high-level description allows the compiler to generate efficient code through use of compile-time analysis and by taking advantage of hardware resources, like texture memory on GPUs. The ease with which complex pipelines can be specified in Forma is demonstrated through several examples. The efficiency of the generated code is evaluated through comparison with a state-of-the-art DSL that targets the same domain, Halide. Our experimental result show that using Forma allows developers to obtain comparable performance on both CPU and GPU with lesser programmer effort. We also show how Forma could be easily integrated with widely used productivity tools like Python and OpenCV. Such an integration would allow users of such tools to develop efficient implementations easily.


international conference on supercomputing | 2013

Towards shared memory consistency models for GPUs

Tyler Sorensen; Ganesh Gopalakrishnan; Vinod Grover

With the widespread use of graphical processing units (GPUs), it is important to ensure that programmers have a clear understanding of their shared memory consistency model, i.e. what values can be read when issued concurrently with writes. Compared to CPUs, GPUs present different shared memory behavior, and we know of no published formal consistency model for them. To fill this void, we establish a formal state transition model of GPU loads, stores, and fences in the language Murphi, and check properties -- captured in litmus tests that pertain to ordering and visibility properties -- over executions using the Murphi model checker.


functional high performance computing | 2014

LambdaJIT: a dynamic compiler for heterogeneous optimizations of STL algorithms

Thibaut Lutz; Vinod Grover

C++11 introduced a set of new features to extend the core language and the standard library. Amongst the new features are basic blocks for concurrency management like threads and atomic operation support, and a new syntax to declare single purpose, one off functions, called lambda functions, which integrate nicely to the Standard Template Library (STL). The STL provides a set of high level algorithms operating on data ranges, often applying a user defined function, which can now be expressed as a lambda function. Together, an STL algorithm and a lambda function provides a concise and efficient way to express a data traversal pattern and localized computation. This paper presents LambdaJIT; a C++11 compiler and a runtime system which enable lambda functions used alongside STL algorithms to be optimized or even re-targeted at runtime. We use compiler integration of the new C++ features to analyze and automatically parallelize the code whenever possible. The compiler also injects part of a programs internal representation into the compiled program binary, which can be used by the runtime to re-compile and optimize the code. We take advantage of the features of lambda functions to create runtime optimizations exceeding those of traditional offline or online compilers. Finally, the runtime can use the embedded intermediate representation with a different backend target to safely offload computation to an accelerator such as a GPU, matching and even outperforming CUDA by up to 10%.

Collaboration


Dive into the Vinod Grover's collaboration.

Researchain Logo
Decentralizing Knowledge