David Finan
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Finan.
IEEE Journal of Solid-state Circuits | 2008
Sriram R. Vangal; Jason Howard; Gregory Ruhl; Saurabh Dighe; Howard Wilson; James W. Tschanz; David Finan; Arvind Singh; Tiju Jacob; Shailendra Jain; Vasantha Erraguntla; Clark Roberts; Yatin Hoskote; Nitin Borkar; Shekhar Borkar
This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.
international solid-state circuits conference | 2010
Jason Howard; Saurabh Dighe; Yatin Hoskote; Sriram R. Vangal; David Finan; Gregory Ruhl; David Jenkins; Howard Wilson; Nitin Borkar; Gerhard Schrom; Fabric Pailet; Shailendra Jain; Tiju Jacob; Satish Yada; Sraven Marella; Praveen Salihundam; Vasantha Erraguntla; Michael Konow; Michael Riepen; Guido Droege; Joerg Lindemann; Matthias Gries; Thomas Apel; Kersten Henriss; Tor Lund-Larsen; Sebastian Steibl; Shekhar Borkar; Vivek De; Rob F. Van der Wijngaart; Timothy G. Mattson
Current developments in microprocessor design favor increased core counts over frequency scaling to improve processor performance and energy efficiency. Coupling this architectural trend with a message-passing protocol helps realize a data-center-on-a-die. The prototype chip (Figs. 5.7.1 and 5.7.7) described in this paper integrates 48 Pentium™ class IA-32 cores [1] on a 6×4 2D-mesh network of tiled core clusters with high-speed I/Os on the periphery. The chip contains 1.3B transistors. Each core has a private 256KB L2 cache (12MB total on-die) and is optimized to support a message-passing-programming model whereby cores communicate through shared memory. A 16KB message-passing buffer (MPB) is present in every tile, giving a total of 384KB on-die shared memory, for increased performance. Power is kept at a minimum by transmitting dynamic, fine-grained voltage-change commands over the network to an on-die voltage-regulator controller (VRC). Further power savings are achieved through active frequency scaling at the tile granularity. Memory accesses are distributed over four on-die DDR3 controllers for an aggregate peak memory bandwidth of 21GB/s at 4× burst. Additionally, an 8-byte bidirectional system interface (SIF) provides 6.4GB/s of I/O bandwidth. The die area is 567mm2 and is implemented in 45nm high-к metal-gate CMOS [2].
international solid-state circuits conference | 2007
Sriram R. Vangal; Jason Howard; Gregory Ruhl; Saurabh Dighe; Howard Wilson; James W. Tschanz; David Finan; Priya Iyer; Arvind Singh; Tiju Jacob; Shailendra Jain; Sriram Venkataraman; Yatin Hoskote; Nitin Borkar
A 275mm2 network-on-chip architecture contains 80 tiles arranged as a 10 times 8 2D array of floating-point cores and packet-switched routers, operating at 4GHz. The 15-F04 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. The 65nm 100M transistor die is designed to achieve a peak performance of 1.0TFLOPS at 1V while dissipating 98W.
IEEE Journal of Solid-state Circuits | 2005
Peter Hazucha; Tanay Karnik; Bradley Bloechel; Colleen Parsons; David Finan; Shekhar Borkar
We demonstrate a fully integrated linear regulator for multisupply voltage microprocessors implemented in a 90 nm CMOS technology. Ultra-fast single-stage load regulation achieves a 0.54-ns response time at 94% current efficiency. For a 1.2-V input voltage and 0.9-V output voltage the regulator enables a 90 mV/sub P-P/ output droop for a 100-mA load step with only a small on-chip decoupling capacitor of 0.6 nF. By using a PMOS pull-up transistor in the output stage we achieved a small regulator area of 0.008 mm/sup 2/ and a minimum dropout voltage of 0.2 V for 100 mA of output current. The area for the 0.6-nF MOS capacitor is 0.090 mm/sup 2/.
international solid-state circuits conference | 2007
J. Tschanz; Nam Sung Kim; Saurabh Dighe; Jason Howard; Gregory Ruhl; S. Vanga; S. Narendra; Yatin Hoskote; Howard Wilson; C. Lam; M. Shuman; Dinesh Somasekhar; Stephen H. Tang; David Finan; Tanay Karnik; Nitin Borkar; Nasser A. Kurd; Vivek De
Temperature, voltage, and current sensors monitor the operation of a TCP/IP offload accelerator engine fabricated in 90nm CMOS, and a control unit dynamically changes frequency, voltage, and body bias for optimum performance and energy efficiency. Fast response to droops and temperature changes is enabled by a multi-PLL clocking unit and on-chip body bias. Adaptive techniques are also used to compensate performance degradation due to device aging, reducing the aging guardband.
IEEE Journal of Solid-state Circuits | 2003
Yatin Hoskote; Bradley Bloechel; Greg Dermer; Vasantha Erraguntla; David Finan; Jason Howard; D. Klowden; Siva G. Narendra; Gregory Ruhl; J. Tschanz; Sriram R. Vangal; V. Veeramachaneni; Howard Wilson; Jianping Xu; Nitin Borkar
This programmable engine is designed to offload TCP inbound processing at wire speed for 10-Gb/s Ethernet, supporting 64-byte minimum packet size. This prototype chip employs a high-speed core and a specialized instruction set. It includes hardware support for dynamically reordering out-of-order packets. In a 90-nm CMOS process, the 8-mm/sup 2/ experimental chip has 460 K transistors. First silicon has been validated to be fully functional and achieves 9.64-Gb/s packet processing performance at 1.72 V and consumes 6.39 W.
international solid-state circuits conference | 2003
Sriram R. Vangal; Yatin Hoskote; Dinesh Somasekhar; Vasantha Erraguntla; Jason Howard; Greg Ruhl; V. Veeramachaneni; David Finan; Sanu K. Mathew; Nitin Borkar
A 32 b single-cycle floating point accumulator that uses base 32 and carry-save format with delayed addition is described. Combined algorithmic, logic and circuit techniques enable multiply-accumulate operation at 5 GHz. In a 90 nm 7M dual-V/sub T/ CMOS process, the 2 mm/sup 2/ prototype contains 230K transistors and dissipates 1.2 W at 5 GHz, 1.2 V and 25/spl deg/C.
symposium on vlsi circuits | 2004
Peter Hazucha; Tanay Karnik; Bradley Bloechel; Colleen Parsons; David Finan; Shekhar Borkar
We demonstrate a fully-integrated linear regulator for multi-supply-voltage microprocessors implemented in a 90 nm CMOS technology. Ultra-fast, single-stage load regulation achieves 0.54 ns response time at 94% current efficiency. This enables 10% peak-to-peak output noise for a 100 mA load step with only a small on-chip decoupling capacitor of 0.6 nF. A PMOS pull-up transistor in the output stage results in a small regulator area of 0.008 mm/sup 2/ and the 0.6 nF MOS capacitor area of 0.090 mm/sup 2/.
international solid-state circuits conference | 1994
Edmund A. Reese; Howard Wilson; D. Nedwek; J. Jex; M. Khaira; T. Burton; P. Nag; H. Kumar; Charles E. Dike; David Finan; M. Haycock
Recent parallel processor supercomputer designs use an active backplane of routers to form the interconnections between processing elements. Today, high-bandwidth interconnect systems capable of scaling to configurations with more than 500 processing nodes tend to use self-timed designs. This avoids clock distribution problems seen in large phase-sensitive synchronous systems. The BiCMOS routing component described in this paper employs 200 MHz clocked communication for large scalable parallel-processor supercomputer systems. This scheme eliminates need for clock edges to be phase-aligned across the clock distribution network. Additionally, router inputs accept data at any phase relationship to the receiving router internal clock.<<ETX>>
international solid-state circuits conference | 2003
Yatin Hoskote; Vasantha Erraguntla; David Finan; Jason Howard; Dan Klowden; Siva G. Narendra; Greg Ruhl; James W. Tschanz; Sriram R. Vangal; V. Veeramachaneni; Howard Wilson; Jianping Xu; Nitin Borkar
This prototype offloads TCP input processing on minimum packet sizes at wire speed for 10Gb/s Ethernet. The design employs a 10GHz core with a specialized instruction set and includes hardware support for dynamically reordering packets. In a 90nm dual-V/sub T/ CMOS process, the 8mm/sup 2/ chip has 260K transistors. Simulation predicts a power dissipation of 1.9W at 1.2V and 10GHz.