Gautham N. Chinya | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gautham N. Chinya is active.

Explore More

Publication

Featured researches published by Gautham N. Chinya.

programming language design and implementation | 2007

EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Perry H. Wang; Jamison D. Collins; Gautham N. Chinya; Hong Jiang; Xinmin Tian; Milind Girkar; Nick Y. Yang; Guei-Yuan Lueh; Hong Wang

Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power. We have prototyped the EXO architecture on a physical heterogeneous platform consisting of an Intel® Core™ 2 Duo processor and an 8-core 32-thread Intel® Graphics Media Accelerator X3000. In addition, we have implemented the CHI integrated programming environment with the Intel® C++ Compiler, runtime toolset, and debugger. On the EXO prototype system, we have enhanced a suite of production-quality media kernels for video and image processing to utilize the accelerator through the CHI programming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone.

international symposium on computer architecture | 2006

Multiple Instruction Stream Processor

Richard A. Hankins; Gautham N. Chinya; Jamison D. Collins; Perry H. Wang; Ryan N. Rakvic; Hong Wang; John Paul Shen

Microprocessor design is undergoing a major paradigm shift towards multi-core designs, in anticipation that future performance gains will come from exploiting threadlevel parallelism in the software. To support this trend, we present a novel processor architecture called the Multiple Instruction Stream Processing (MISP) architecture. MISP introduces the sequencer as a new category of architectural resource, and defines a canonical set of instructions to support user-level inter-sequencer signaling and asynchronous control transfer. MISP allows an application program to directly manage user-level threads without OS intervention. By supporting the classic cache-coherent shared-memory programming model, MISP does not require a radical shift in the multithreaded programming paradigm. This paper describes the design and evaluation of the MISP architecture for the IA-32 family of microprocessors. Using a research prototype MISP processor built on an IA-32-based multiprocessor system equipped with special firmware, we demonstrate the feasibility of implementing the MISP architecture. We then examine the utility of MISP by (1) assessing the key architectural tradeoffs of the MISP architecture design and (2) showing how legacy multithreaded applications can be migrated to MISP with relative ease.

field programmable gate arrays | 2009

Intel nehalem processor core made FPGA synthesizable

Graham Schelle; Jamison D. Collins; Ethan Schuchman; Perry H. Wang; Xiang Zou; Gautham N. Chinya; Ralf Plate; Thorsten Mattner; Franz Olbrich; Per Hammarlund; Ronak Singhal; Jim Brayton; Sebastian Steibl; Hong Wang

We present an FPGA-synthesizable version of the Intel Atom processor core, synthesized to a Virtex-5 based FPGA emulation system. To make the production Atom design in SystemVerilog synthesizable through industry standard EDA tool flow, we transformed and mapped latches in the design, converted clock gating, and replaced nonsynthesizable constructs with FPGA-synthesizable counterparts. Additionally, as the target FPGA emulator is hosted on a PC platform with the Pentium-based CPU socket that supports a significantly different front side bus (FSB) protocol from that of the Atom processor, we replaced the existing bus control logic in the Atom core with an alternate FSB protocol to communicate with the rest of the PC platform. With these efforts, we succeeded in synthesizing the entire Atom processor core to fit within a single Virtex-5 LX330 FPGA. The synthesizable Atom core runs at 50Mhz on the Pentium PC motherboard with fully functional I/O peripherals. It is capable of booting off-the-shelf MS-DOS, Windows XP and Linux operating systems, and executing standard x86 workloads.

international conference on parallel architectures and compilation techniques | 2008

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Henry Wong; Anne Bracy; Ethan Schuchman; Tor M. Aamodt; Jamison D. Collins; Perry H. Wang; Gautham N. Chinya; Ankur Khandelwal Groen; Hong Jiang; Hong Wang

Moores Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically “fuses” existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8×.

IEEE Micro | 2018

Loihi: A Neuromorphic Manycore Processor with On-Chip Learning

Mike Davies; Narayan Srinivasa; Tsung-Han Lin; Gautham N. Chinya; Yongqiang Cao; Sri Harsha Choday; Georgios D. Dimou; Prasad Joshi; Nabil Imam; Shweta Jain; Yuyun Liao; Chit-Kwan Lin; Andrew Lines; Ruokun Liu; Deepak A. Mathaikutty; Steven McCoy; Arnab Paul; Jonathan Tse; Guruguhanathan Venkataramanan; Yi-Hsin Weng; Andreas Wild; Yoonseok Yang; Hong Wang

Loihi is a 60-mm2 chip fabricated in Intels 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and, most importantly, programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve LASSO optimization problems with over three orders of magnitude superior energy-delay-product compared to conventional solvers running on a CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation, outperforming all known conventional solutions.

computing frontiers | 2011

AstroLIT: enabling simulation-based microarchitecture comparison between Intel® and Transmeta designs

Guilherme Ottoni; Gautham N. Chinya; Gerolf F. Hoflehner; Jamison D. Collins; Amit Kumar; Ethan Schuchman; David R. Ditzel; Ronak Singhal; Hong Wang

While the out-of-order engine has been the mainstream micro-architecture-design paradigm to achieve high performance, Transmeta took a different approach using dynamic binary translation (BT). To enable detailed comparison of these two radically different processor-design approaches, it is natural to leverage well-established simulation-based methodologies. However, BT-based processor designs pose new challenges to standard sampling-based simulation methodologies. This paper describes these challenges, and it also introduces the AstroLIT methodology to address them.

international conference on supercomputing | 2007

Sequencer virtualization

Perry H. Wang; Jamison D. Collins; Gautham N. Chinya; Bernard Lint; Asit Mallick; Koichi Yamada; Hong Wang

The Multiple Instruction Stream Processor (MISP) architecture introduces the sequencer as a new class of architectural resource, and provides a minimalist user-level MIMD instruction set extension for application programs to directly control execution of concurrent instruction streams on these sequencers. As with classic architectural resources, namely, registers and memory, the sequencer architectural resource can be subject to virtualization. This paper details the idea of Sequencer Virtualization (SV), a foundational architectural support to decouple architectural virtual sequencers from physical sequencers. SV enables more efficient utilization of sequencer resources at the microarchitectural level while maintaining a consistent programming interface at the architectural level. To evaluate the key tradeoffs for SV, we conduct extensive experiments by implementing a prototype SV system using a custom firmware on a large-scale multiprocessor system. Using the prototype SV system, we demonstrate that SV improves efficiency in sequencer utilization while incurring little performance overhead. In particular, for a set of real multithreaded workloads, SV can significantly improve sequencer utilization, achieving an average of 32% better wall-clock performance than MISP without SV support in a multi-programming environment.

Operating Systems Review | 2011

Bothnia: a dual-personality extension to the Intel integrated graphics driver

Gautham N. Chinya; Jamison D. Collins; Perry H. Wang; Hong Jiang; Guei-Yuan Lueh; Thomas A. Piazza; Hong Wang

In this paper, we introduce Bothnia, an extension to the Intel production graphics driver to support a shared virtual memory heterogeneous multithreading programming model. With Bothnia, the Intel graphics device driver can support both the traditional 3D graphics rendering software stack and a new class of heterogeneous multithreaded applications, which can use both IA (Intel Architecture) CPU cores and Intel integrated Graphics and Media Accelerator (GMA) cores in the same virtual address space. We describe the necessary architectural supports in both IA CPU and the GMA cores and present a reference Bothnia implementation. For a set of GPU accelerated media applications on a PC platform with Intel Core 2 Duo CPU and the Intel integrated GMA X3000 running under the Windows XP operating system, Bothnia achieves an average speedup of 3.6x compared to using the GPU as a device, primarily due to Bothnias support for creation of shared virtual address space between heterogeneous threads of the same application spread on both IA CPU and GMA cores.

IEEE Computer | 2018

Programming Spiking Neural Networks on Intel’s Loihi

Chit-Kwan Lin; Andreas Wild; Gautham N. Chinya; Yongqiang Cao; Mike Davies; Daniel M. Lavery; Hong Wang

Loihi is Intel’s novel, manycore neuromorphic processor and is the first of its kind to feature a microcode-programmable learning engine that enables on-chip training of spiking neural networks (SNNs). The authors present the Loihi toolchain, which consists of an intuitive Python-based API for specifying SNNs, a compiler and runtime for building and executing SNNs on Loihi, and several target platforms (Loihi silicon, FPGA, and functional simulator). To showcase the toolchain, the authors describe how to build, train, and use a SNN to classify handwritten digits from the MNIST database.

Archive | 2005