Yatin Hoskote | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yatin Hoskote is active.

Explore More

Publication

Featured researches published by Yatin Hoskote.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2009

Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives

Radu Marculescu; Umit Y. Ogras; Li-Shiuan Peh; Natalie D. Enright Jerger; Yatin Hoskote

To alleviate the complex communication problems that arise as the number of on-chip components increases, network-on-chip (NoC) architectures have been recently proposed to replace global interconnects. In this paper, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis, and solution evaluation. Motivation, problem description, proposed approaches, and open issues are discussed for each problem from system, microarchitecture, and circuit perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective.

IEEE Journal of Solid-state Circuits | 2008

An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS

Sriram R. Vangal; Jason Howard; Gregory Ruhl; Saurabh Dighe; Howard Wilson; James W. Tschanz; David Finan; Arvind Singh; Tiju Jacob; Shailendra Jain; Vasantha Erraguntla; Clark Roberts; Yatin Hoskote; Nitin Borkar; Shekhar Borkar

This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.

international solid-state circuits conference | 2010

A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS

Jason Howard; Saurabh Dighe; Yatin Hoskote; Sriram R. Vangal; David Finan; Gregory Ruhl; David Jenkins; Howard Wilson; Nitin Borkar; Gerhard Schrom; Fabric Pailet; Shailendra Jain; Tiju Jacob; Satish Yada; Sraven Marella; Praveen Salihundam; Vasantha Erraguntla; Michael Konow; Michael Riepen; Guido Droege; Joerg Lindemann; Matthias Gries; Thomas Apel; Kersten Henriss; Tor Lund-Larsen; Sebastian Steibl; Shekhar Borkar; Vivek De; Rob F. Van der Wijngaart; Timothy G. Mattson

Current developments in microprocessor design favor increased core counts over frequency scaling to improve processor performance and energy efficiency. Coupling this architectural trend with a message-passing protocol helps realize a data-center-on-a-die. The prototype chip (Figs. 5.7.1 and 5.7.7) described in this paper integrates 48 Pentium™ class IA-32 cores [1] on a 6×4 2D-mesh network of tiled core clusters with high-speed I/Os on the periphery. The chip contains 1.3B transistors. Each core has a private 256KB L2 cache (12MB total on-die) and is optimized to support a message-passing-programming model whereby cores communicate through shared memory. A 16KB message-passing buffer (MPB) is present in every tile, giving a total of 384KB on-die shared memory, for increased performance. Power is kept at a minimum by transmitting dynamic, fine-grained voltage-change commands over the network to an on-die voltage-regulator controller (VRC). Further power savings are achieved through active frequency scaling at the tile granularity. Memory accesses are distributed over four on-die DDR3 controllers for an aggregate peak memory bandwidth of 21GB/s at 4× burst. Additionally, an 8-byte bidirectional system interface (SIF) provides 6.4GB/s of I/O bandwidth. The die area is 567mm2 and is implemented in 45nm high-к metal-gate CMOS [2].

international solid-state circuits conference | 2007

An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS

Sriram R. Vangal; Jason Howard; Gregory Ruhl; Saurabh Dighe; Howard Wilson; James W. Tschanz; David Finan; Priya Iyer; Arvind Singh; Tiju Jacob; Shailendra Jain; Sriram Venkataraman; Yatin Hoskote; Nitin Borkar

A 275mm2 network-on-chip architecture contains 80 tiles arranged as a 10 times 8 2D array of floating-point cores and packet-switched routers, operating at 4GHz. The 15-F04 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. The 65nm 100M transistor die is designed to achieve a peak performance of 1.0TFLOPS at 1V while dissipating 98W.

international symposium on microarchitecture | 2007

A 5-GHz Mesh Interconnect for a Teraflops Processor

Yatin Hoskote; Sriram R. Vangal; Arvind Singh; Nitin Borkar; Shekhar Borkar

A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W. A 2D on-die mesh interconnection network operating at 5 GHz provides the high-performance communication fabric to connect the cores. The network delivers a bisection bandwidth of 2.56 Terabits per second and a per hop fall-through latency of 1 nanosecond.

international solid-state circuits conference | 2007

Adaptive Frequency and Biasing Techniques for Tolerance to Dynamic Temperature-Voltage Variations and Aging

J. Tschanz; Nam Sung Kim; Saurabh Dighe; Jason Howard; Gregory Ruhl; S. Vanga; S. Narendra; Yatin Hoskote; Howard Wilson; C. Lam; M. Shuman; Dinesh Somasekhar; Stephen H. Tang; David Finan; Tanay Karnik; Nitin Borkar; Nasser A. Kurd; Vivek De

Temperature, voltage, and current sensors monitor the operation of a TCP/IP offload accelerator engine fabricated in 90nm CMOS, and a control unit dynamically changes frequency, voltage, and body bias for optimum performance and energy efficiency. Fast response to droops and temperature changes is enabled by a multi-PLL clocking unit and on-chip body bias. Adaptive techniques are also used to compensate performance degradation due to device aging, reducing the aging guardband.

international solid-state circuits conference | 2004

Ultra-low voltage circuits and processor in 180nm to 90nm technologies with a swapped-body biasing technique

Siva G. Narendra; James W. Tschanz; Joseph Hofsheier; Bradley Bloechel; Sriram R. Vangal; Yatin Hoskote; Stephen H. Tang; Dinesh Somasekhar; Ali Keshavarzi; Vasantha Erraguntla; Greg Dermer; Nitin Borkar; Shekhar Borkar; Vivek De

A low-voltage swapped-body biasing technique where PMOS bodies are connected to ground and NMOS bodies to Vcc is evaluated. Available measurements show more than 2.6x frequency improvement at 0.5V Vcc and the ability to reduce Vcc by 0.2V for the same frequency compared to no body bias in 180 to 90nm CMOS technologies.

IEEE Journal of Solid-state Circuits | 2006

A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization

Sriram R. Vangal; Yatin Hoskote; Nitin Borkar; Atila Alvandpour

A pipelined single-precision floating-point multiply-accumulator (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic with delayed addition is described. A combination of algorithmic, logic, and circuit techniques enables multiply-accumulate operations at speeds exceeding 3 GHz with single-cycle throughput. The optimizations allow removal of the costly normalization step from the critical accumulate loop. This logic is conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In addition, an improved leading-zero anticipator (LZA) and overflow prediction logic applicable to carry-save format is presented. In a 90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains 230K transistors. The fully functional first silicon achieves 6.2 GFlops of performance while dissipating 1.2 W at 3.1 GHz, 1.3-V supply

international conference on computer design | 1995

Automatic extraction of the control flow machine and application to evaluating coverage of verification vectors

Yatin Hoskote; Dinos Moundanos; Jacob A. Abraham

Simulation is still the primary, although inadequate, resource for verifying the conformity of a design to its functional specification. Fortunately, most errors in the early stages of design involve only the control flow in the circuit. We define the functional coverage of a given sequence of verification vectors as the amount of control behavior exercised by them. We present a novel technique for automatically extracting the control flow of a design on the basis of the underlying mathematical model. Significantly, this extraction is independent of the circuit description style. The Extracted Control Flow Machine (ECFM) is then used for estimation of functional coverage and to provide information that will help the designer improve the quality of his or her tests.

IEEE Journal of Solid-state Circuits | 2003

A TCP offload accelerator for 10 Gb/s Ethernet in 90-nm CMOS

Yatin Hoskote; Bradley Bloechel; Greg Dermer; Vasantha Erraguntla; David Finan; Jason Howard; D. Klowden; Siva G. Narendra; Gregory Ruhl; J. Tschanz; Sriram R. Vangal; V. Veeramachaneni; Howard Wilson; Jianping Xu; Nitin Borkar

This programmable engine is designed to offload TCP inbound processing at wire speed for 10-Gb/s Ethernet, supporting 64-byte minimum packet size. This prototype chip employs a high-speed core and a specialized instruction set. It includes hardware support for dynamically reordering out-of-order packets. In a 90-nm CMOS process, the 8-mm/sup 2/ experimental chip has 460 K transistors. First silicon has been validated to be fully functional and achieves 9.64-Gb/s packet processing performance at 1.72 V and consumes 6.39 W.

Explore More