Venkatraman Govindaraju

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Venkatraman Govindaraju is active.

Explore More

Publication

Featured researches published by Venkatraman Govindaraju.

high-performance computer architecture | 2011

Dynamically Specialized Datapaths for energy efficient computing

Venkatraman Govindaraju; Chen-Han Ho; Karthikeyan Sankaralingam

Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.

IEEE Micro | 2012

DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing

Venkatraman Govindaraju; Chen-Han Ho; Tony Nowatzki; Jatin Chhugani; Nadathur Satish; Karthikeyan Sankaralingam; Changkyu Kim

The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization. By dynamically specializing frequently executing regions and applying parallelism mechanisms, DySER provides efficient functionality and parallelism specialization. It outperforms an out-of-order CPU, Streaming SIMD Extensions (SSE) acceleration, and GPU acceleration while consuming less energy. The full-system field-programmable gate array (FPGA) prototype of DySER integrated into OpenSparc demonstrates a practical implementation.

international symposium on microarchitecture | 2008

Toward a multicore architecture for real-time ray-tracing

Venkatraman Govindaraju; Peter Djeu; Karthikeyan Sankaralingam; Mary K. Vernon; William R. Mark

Significant improvement to visual quality for real-time 3D graphics requires modeling of complex illumination effects like soft-shadows, reflections, and diffuse lighting interactions. The conventional Z-buffer algorithm driven GPU model does not provide sufficient support for this improvement. This paper targets the entire graphics system stack and demonstrates algorithms, a software architecture, and a hardware architecture for real-time rendering with a paradigm shift to ray-tracing. The three unique features of our system called Copernicus are support for dynamic scenes, high image quality, and execution on programmable multicore architectures. The focus of this paper is the synergy and interaction between applications, architecture, and evaluation. First, we describe the ray-tracing algorithms which are designed to use redundancy and partitioning to achieve locality. Second, we describe the architecture which uses ISA specialization, multi-threading to hide memory delays and supports only local coherence. Finally, we develop an analytical performance model for our 128-core system, using measurements from simulation and a scaled-down prototype system. More generally, this paper addresses an important issue of mechanisms and evaluation for challenging workloads for future processors. Our results show that a single 8-core tile (each core 4-way multithreaded) can be almost 100% utilized and sustain 10 million rays/second. Sixteen such tiles, which can fit on a 240 mm2 chip in 22 nm technology, make up the system and with our anticipated improvements in algorithms, can sustain real-time rendering. The mechanisms and the architecture can potentially support other domains like irregular scientific computations and physics computations.

high performance computer architecture | 2012

Design, integration and implementation of the DySER hardware accelerator into OpenSPARC

Jesse Benson; Ryan Cofell; Chris Frericks; Chen-Han Ho; Venkatraman Govindaraju; Tony Nowatzki; Karthikeyan Sankaralingam

Accelerators and specialization in various forms are emerging as a way to increase processor performance. Examples include Navigo, Conservation-Cores, BERET, and DySER. While each of these employ different primitives and principles to achieve specialization, they share some common concerns with regards to implementation. Two of these concerns are: how to integrate them with a commercial processor and how to develop their compiler toolchain. This paper undertakes an implementation study of one design point: integration of DySER into OpenSPARC, a design we call OpenSPlySER. We report on our implementation exercise and quantitative results, and conclude with a set of our lessons learned. We demonstrate that DySER delivers on its goal of providing a non-intrusive accelerator design. OpenSPlySERruns on an Virtex-5 FPGA, boots unmodified Linux, and runs most of the SPECINT benchmarks with our compiler. Due to physical design constraints, speedups on full benchmarks are modest for the FPGA prototype. On targeted microbenchmarks, OpenSPlySER delivers up to a 31-fold speedup over the baseline OpenSPARC. We conclude with some lessons learned from this somewhat unique exercise of significantly modifying a commercial processor. To the best of our knowledge, this work is one of the most ambitious extensions of OpenSPARC.

international conference on parallel architectures and compilation techniques | 2013

Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

Venkatraman Govindaraju; Tony Nowatzki; Karthikeyan Sankaralingam

Modern microprocessors exploit data level parallelism through in-core data-parallel accelerators in the form of short vector ISA extentions such as SSE/AVX and NEON. Although these ISA extentions have existed for decades, compilers do not generate good quality, high-performance vectorized code without significant programmer intervention and manual optimization. The fundamental problem is that the architecture is too rigid, which overly complicates the compilers role and simultaneously restricts the types of codes that the compiler can profitably map to these data-parallel accelerators. We take a fundamentally new approach that first makes the architecture more flexible and exposes this flexibility to the compiler. Counter-intuitively, increasing the complexity of the accelerators interface to the compiler enables a more robust and efficient system that supports many types of codes. This system also enables the performance of auto-acceleration to be comparable to that of manually-optimized implementations. To address the challenges of compiling for flexible accelerators, we propose a variant of Program Dependence Graph called the Access Execute Program Dependence Graph to capture spatio-temporal aspects of memory accesses and computations. We implement a compiler that uses this representation and evaluate it by considering both a suite of kernels developed and tuned for SSE, and “challenge” data-parallel applications, the Parboil benchmarks. We show that our compiler, which targets the DySER accelerator, provides high-quality code for the kernels and full applications, commonly reaching within 30% of manually-optimized and out-performs compiler-produced SSE code by 1.8×.

IEEE Computer Architecture Letters | 2015

A Graph-Based Program Representation for Analyzing Hardware Specialization Approaches

Tony Nowatzki; Venkatraman Govindaraju; Karthikeyan Sankaralingam

Hardware specialization has emerged as a promising paradigm for future microprocessors. Unfortunately, it is natural to develop and evaluate such architectures within end-to-end vertical silos spanning application, language/ compiler, hardware design and evaluation tools, leaving little opportunity for cross-architecture analysis and innovation. This paper develops a novel program representation suitable for modeling heterogeneous architectures with specialized hardware, called the transformable dependence graph (TDG), which combines semantic information about program properties and low-level hardware events in a single representation. We demonstrate, using four example architectures from the literature, that the TDG is a feasible, simple, and accurate modeling technique for transparent specialization architectures, enabling cross-domain comparison and design-space exploration.

ieee hot chips symposium | 2012

Prototyping the DySER specialization architecture with OpenSPARC

Jesse Benson; Ryan Cofell; Chris Frericks; Venkatraman Govindaraju; Chen-Han Ho; Zachary Marzec; Tony Nowatzki; Karu Sankaralingam

This paper describes the prototype implementation of the DySER specialization architecture integrated into the OpenSPARC processor. The papers description covers the hardware, compiler, and application tuning. The prototype system provides speedups up to 14× over OpenSPARC (geometric mean 5×). The architecture is more flexible than SIMD and GPU-based acceleration while supporting a more diverse set of workloads.

international symposium on microarchitecture | 2017

A many-core architecture for in-memory data processing

Sandeep R. Agrawal; Sam Idicula; Arun Raghavan; Evangelos Vlachos; Venkatraman Govindaraju; Venkatanathan Varadarajan; Cagri Balkesen; Georgios Giannikis; Charlie Roth; Nipun Agarwal; Eric Sedlar

For many years, the highest energy cost in processing has been data movement rather than computation, and energy is the limiting factor in processor design [21]. As the data needed for a single application grows to exabytes [56], there is clearly an opportunity to design a bandwidth-optimized architecture for big data computation by specializing hardware for data movement. We present the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads. The DPU contains a unique Data Movement System (DMS), which provides hardware acceleration for data movement and partitioning operations at the memory controller that is sufficient to keep up with DDR bandwidth. The DPU also provides acceleration for core to core communication via a unique hardware RPC mechanism called the Atomic Transaction Engine. Comparison of a DPU chip fabricated in 40nm with a Xeon processor on a variety of data processing applications shows a 3× - 15× performance per watt advantage.CCS CONCEPTS• Computer systems organization

international congress on big data | 2017

Big Data Processing: Scalability with Extreme Single-Node Performance

Venkatraman Govindaraju; Sam Idicula; Sandeep R. Agrawal; Venkatanathan Vardarajan; Arun Raghavan; Jarod Wen; Cagri Balkesen; Georgios Giannikis; Nipun Agarwal; Eric Sedlar

\rightarrow

international symposium on computer architecture | 2011