Tushar Krishna | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tushar Krishna is active.

Explore More

Publication

Featured researches published by Tushar Krishna.

ACM Sigarch Computer Architecture News | 2011

The gem5 simulator

Nathan L. Binkert; Bradford M. Beckmann; Gabriel Black; Steven K. Reinhardt; Ali G. Saidi; Arkaprava Basu; Joel Hestness; Derek R. Hower; Tushar Krishna; Somayeh Sardashti; Rathijit Sen; Korey Sewell; Muhammad Shoaib; Nilay Vaish; Mark D. Hill; David A. Wood

The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

international symposium on performance analysis of systems and software | 2009

GARNET: A detailed on-chip network model inside a full-system simulator

Niket Agarwal; Tushar Krishna; Li-Shiuan Peh; Niraj K. Jha

Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This was because of fast, single-cycle on-chip communication. The interconnect power was also insignificant compared to the transistor power. With uniprocessor designs providing diminishing returns and the advent of chip multiprocessors (CMPs) in mainstream systems, the on-chip network that connects different processing cores has become a critical part of the design. Transistor miniaturization has led to high global wire delay, and interconnect power comparable to transistor power. CMP design proposals can no longer ignore the interaction between the memory hierarchy and the interconnection network that connects various elements. This necessitates a detailed and accurate interconnection network model within a full-system evaluation framework. Ignoring the interconnect details might lead to inaccurate results when simulating a CMP architecture. It also becomes important to analyze the impact of interconnection network optimization techniques on full system behavior. In this light, we developed a detailed cycle-accurate interconnection network model (GARNET), inside the GEMS full-system simulation framework. GARNET models a classic five-stage pipelined router with virtual channel (VC) flow control. Microarchitectural details, such as flit-level input buffers, routing logic, allocators and the crossbar switch, are modeled. GARNET, along with GEMS, provides a detailed and accurate memory system timing model. To demonstrate the importance and potential impact of GARNET, we evaluate a shared and private L2 CMP with a realistic state-of-the-art interconnection network against the original GEMS simple network. The objective of the evaluation was to figure out which configuration is better for a particular workload. We show that not modeling the interconnect in detail might lead to an incorrect outcome. We also evaluate Express Virtual Channels (EVCs), an on-chip network flow control proposal, in a full-system fashion. We show that in improving on-chip network latency-throughput, EVCs do lead to better overall system runtime, however, the impact varies widely across applications.

international solid-state circuits conference | 2016

14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks

Yu-Hsin Chen; Tushar Krishna; Joel S. Emer; Vivienne Sze

Deep learning using convolutional neural networks (CNN) gives state-of-the-art accuracy on many computer vision tasks (e.g. object detection, recognition, segmentation). Convolutions account for over 90% of the processing in CNNs for both inference/testing and training, and fully convolutional networks are increasingly being used. To achieve state-of-the-art accuracy requires CNNs with not only a larger number of layers, but also millions of filters weights, and varying shapes (i.e. filter sizes, number of filters, number of channels) as shown in Fig. 14.5.1. For instance, AlexNet [1] uses 2.3 million weights (4.6MB of storage) and requires 666 million MACs per 227×227 image (13kMACs/pixel). VGG16 [2] uses 14.7 million weights (29.4MB of storage) and requires 15.3 billion MACs per 224×224 image (306kMACs/pixel). The large number of filter weights and channels results in substantial data movement, which consumes significant energy.

IEEE Journal of Solid-state Circuits | 2017

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

Yu-Hsin Chen; Tushar Krishna; Joel S. Emer; Vivienne Sze

international symposium on microarchitecture | 2011

Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Tushar Krishna; Li-Shiuan Peh; Bradford M. Beckmann; Steven K. Reinhardt

The prevalence of multicore architectures has accentuated the need for scalable cache coherence solutions. Many of the proposed designs use a mix of 1-to-1, 1-to-many (1-to-M), and many-to-1 (M-to-1) communication to maintain data coherence and consistency. The on-chip network is the communication backbone that needs to handle all these flows efficiently to allow these protocols to scale. However, most research in on-chip networks has focused on optimizing only 1-to-1 traffic. There has been some recent work addressing 1-to-M traffic by proposing the forking of multicast packets within the network at routers, but these techniques incur high packet delays and power penalties. There has been little research in addressing M-to-1 traffic. We propose two in-network techniques, Flow Across Network Over Uncongested Trees (FANOUT) and Flow AggregatioN In-Network (FANIN), which perform efficient 1-to-M forking and M-to-1 aggregation, respectively, such that packets incur only single-cycle delays at most routers along their path, thus approaching an ideal network (one that incurs only wire delay/energy). Full-system simulations on a 64-core CMP with SPLASH-2 and PARSEC benchmarks show that FANOUT and FANIN together reduce runtime by 14.9% and network energy by 40.2%, on average, compared to state-of-the-art networks, operating at just 1% and 9.6% above the runtime and energy of an ideal network.

high performance interconnects | 2008

NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication

Tushar Krishna; Amit Kumar; Patrick Chiang; Mattan Erez; Li-Shiuan Peh

As processor core counts increase, networks-on-chip (NoCs) are becoming an increasingly popular interconnection fabric due to their ability to supply high bandwidth. However, NoCs need to deliver this high bandwidth at low latencies, while keeping within a tight power envelope. In this paper, we present a novel NoC with hybrid interconnect that leverages multiple types of interconnects - specifically, conventional full-swing short-range wires for the data path, in conjunction with low-swing, multi-drop wires with long-range, ultra-low-latency communication for the flow control signals. We show how this proposed system can be used to overcome key limitations of express virtual channels (EVC), a recently proposed flow control technique that allows packets to bypass intermediate routers to simultaneously improve energy-delay-throughput. Our preliminary results show up to a 8.2% reduction in power and up to a 44% improvement in latency under heavy load compared to the original EVC design that only uses the conventional full-swing interconnects.

high-performance computer architecture | 2013

Breaking the on-chip latency barrier using SMART

Tushar Krishna; Chia-Hsin Owen Chen; Woo Cheol Kwon; Li-Shiuan Peh

As the number of on-chip cores increases, scalable on-chip topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, such that the low-load network latency between a source and destination is equal to the number of routers + links (i.e. hops×2) between them. OS/compiler and cache coherence protocols designers often try to limit communication to within a few hops, since on-chip latency is critical for their scalability. In this work, we propose an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination. We do not add any additional fast physical express links in the data-path; instead we drive the shared crossbars and links asynchronously up to multiple-hops within a single cycle. We design a router + link microarchitecture to achieve such a traversal, and a flow-control technique to arbitrate and setup multi-hop paths within a cycle. A place-and-routed design at 45nm achieves 11 hops within a 1GHz cycle for paths without turns (9 for paths with turns). We observe 5-8X reduction in low-load latencies across synthetic traffic patterns on an 8×8 CMP, compared to a baseline 1-cycle router. Full-system simulations with SPLASH-2 and PAR-SEC benchmarks demonstrate 27/52% and 20/59% reduction in runtime and EDP for Private/Shared L2 designs.

international symposium on computer architecture | 2014

SCORPIO: a 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering

Bhavya K. Daya; Chia-Hsin Chen; Suvinay Subramanian; Woo Cheol Kwon; Sunghyun Park; Tushar Krishna; Jim Holt; Anantha P. Chandrakasan; Li-Shiuan Peh

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs. We present SCORPIO, an ordered mesh Network-on-Chip (NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupledfrom the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application runtime reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively. The SCORPIO architecture is incorporated in an 11 mm-by13mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power ArchitectureTMcores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833 MHz post-layout) with an estimated power of 28.8 W (768 mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.

design, automation, and test in europe | 2013

SMART: a single-cycle reconfigurable NoC for SoC applications

Chia-Hsin Owen Chen; Sunghyun Park; Tushar Krishna; Suvinay Subramanian; Anantha P. Chandrakasan; Li-Shiuan Peh

As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs to interconnect the multiple cores on the chip. Given aggressive SoC design targets, NoCs have to deliver low latency, high bandwidth, at low power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures and tailors a generic mesh topology for SoC applications at runtime. The heart of our SMART NoC is a novel low-swing clockless repeated link circuit embedded within the router crossbars, that allows packets to potentially bypass all the way from source to destination core within a single clock cycle, without being latched at any intermediate router. Our clockless repeater link has been proven in silicon in 45nm SOI. Results show that at 2GHz, we can traverse 8mm within a single cycle, i.e. 8 hops with 1mm cores. We implement the SMART NoC to layout and show that SMART NoC gives 60% latency savings, and 2.2X power savings compared to a baseline mesh NoC.

international conference on computer design | 2010

SWIFT: A SWing-reduced interconnect for a Token-based Network-on-Chip in 90nm CMOS

Tushar Krishna; Jacob Postman; Christopher Edmonds; Li-Shiuan Peh; Patrick Chiang

With the advent of chip multi-processors (CMPs), on-chip networks are critical for providing low-power communications that scale to high core counts. With this motivation, we present a 64-bit, 8×8 mesh Network-on-Chip in 90nm CMOS that: a) bypasses flit buffering in routers using Token Flow Control, thereby reducing buffer power along the control path, and b) uses low-voltage-swing crossbars and links to reduce interconnect energy in the data path. These approaches enable 38% power savings and 39% latency reduction, when compared with an equivalent baseline network. An experimental 2×2 core prototype, operating at 400 MHz, validates our design.

Explore More