Jason Howard | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jason Howard is active.

Explore More

Publication

Featured researches published by Jason Howard.

IEEE Journal of Solid-state Circuits | 2008

An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS

Sriram R. Vangal; Jason Howard; Gregory Ruhl; Saurabh Dighe; Howard Wilson; James W. Tschanz; David Finan; Arvind Singh; Tiju Jacob; Shailendra Jain; Vasantha Erraguntla; Clark Roberts; Yatin Hoskote; Nitin Borkar; Shekhar Borkar

This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.

international solid-state circuits conference | 2010

A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS

Jason Howard; Saurabh Dighe; Yatin Hoskote; Sriram R. Vangal; David Finan; Gregory Ruhl; David Jenkins; Howard Wilson; Nitin Borkar; Gerhard Schrom; Fabric Pailet; Shailendra Jain; Tiju Jacob; Satish Yada; Sraven Marella; Praveen Salihundam; Vasantha Erraguntla; Michael Konow; Michael Riepen; Guido Droege; Joerg Lindemann; Matthias Gries; Thomas Apel; Kersten Henriss; Tor Lund-Larsen; Sebastian Steibl; Shekhar Borkar; Vivek De; Rob F. Van der Wijngaart; Timothy G. Mattson

Current developments in microprocessor design favor increased core counts over frequency scaling to improve processor performance and energy efficiency. Coupling this architectural trend with a message-passing protocol helps realize a data-center-on-a-die. The prototype chip (Figs. 5.7.1 and 5.7.7) described in this paper integrates 48 Pentium™ class IA-32 cores [1] on a 6×4 2D-mesh network of tiled core clusters with high-speed I/Os on the periphery. The chip contains 1.3B transistors. Each core has a private 256KB L2 cache (12MB total on-die) and is optimized to support a message-passing-programming model whereby cores communicate through shared memory. A 16KB message-passing buffer (MPB) is present in every tile, giving a total of 384KB on-die shared memory, for increased performance. Power is kept at a minimum by transmitting dynamic, fine-grained voltage-change commands over the network to an on-die voltage-regulator controller (VRC). Further power savings are achieved through active frequency scaling at the tile granularity. Memory accesses are distributed over four on-die DDR3 controllers for an aggregate peak memory bandwidth of 21GB/s at 4× burst. Additionally, an 8-byte bidirectional system interface (SIF) provides 6.4GB/s of I/O bandwidth. The die area is 567mm2 and is implemented in 45nm high-к metal-gate CMOS [2].

international solid-state circuits conference | 2007

An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS

Sriram R. Vangal; Jason Howard; Gregory Ruhl; Saurabh Dighe; Howard Wilson; James W. Tschanz; David Finan; Priya Iyer; Arvind Singh; Tiju Jacob; Shailendra Jain; Sriram Venkataraman; Yatin Hoskote; Nitin Borkar

A 275mm2 network-on-chip architecture contains 80 tiles arranged as a 10 times 8 2D array of floating-point cores and packet-switched routers, operating at 4GHz. The 15-F04 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. The 65nm 100M transistor die is designed to achieve a peak performance of 1.0TFLOPS at 1V while dissipating 98W.

IEEE Journal of Solid-state Circuits | 2011

A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling

Jason Howard; Saurabh Dighe; Sriram R. Vangal; Gregory Ruhl; Nitin Borkar; Shailendra Jain; Vasantha Erraguntla; Michael Konow; Michael Riepen; Matthias Gries; Guido Droege; Tor Lund-Larsen; Sebastian Steibl; S. Borkar; Vivek De; R Van Der Wijngaart

This paper describes a multi-core processor that integrates 48 cores, 4 DDR3 memory channels, and a voltage regulator controller in a 64 2D-mesh network-on-chip architecture. Located at each mesh node is a five-port virtual cut-through packet-switched router shared between two IA-32 cores. Core-to-core communication uses message passing while exploiting 384 KB of on-die shared memory. Fine grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. At the nominal 1.1 V supply, the cores operate at 1 GHz while the 2D-mesh operates at 2 GHz. As performance and voltage scales, the processor dissipates between 25 W and 125 W. The processor is implemented in 45 nm Hi-K CMOS and has 1.3 billion transistors.

ieee international conference on high performance computing data and analytics | 2010

The 48-core SCC Processor: the Programmer's View

Timothy G. Mattson; Michael Riepen; Thomas Lehnig; Paul Brett; Patrick Kennedy; Jason Howard; Sriram R. Vangal; Nitin Borkar; Gregory Ruhl; Saurabh Dighe

The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmers view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

international solid-state circuits conference | 2007

Adaptive Frequency and Biasing Techniques for Tolerance to Dynamic Temperature-Voltage Variations and Aging

J. Tschanz; Nam Sung Kim; Saurabh Dighe; Jason Howard; Gregory Ruhl; S. Vanga; S. Narendra; Yatin Hoskote; Howard Wilson; C. Lam; M. Shuman; Dinesh Somasekhar; Stephen H. Tang; David Finan; Tanay Karnik; Nitin Borkar; Nasser A. Kurd; Vivek De

Temperature, voltage, and current sensors monitor the operation of a TCP/IP offload accelerator engine fabricated in 90nm CMOS, and a control unit dynamically changes frequency, voltage, and body bias for optimum performance and energy efficiency. Fast response to droops and temperature changes is enabled by a multi-PLL clocking unit and on-chip body bias. Adaptive techniques are also used to compensate performance degradation due to device aging, reducing the aging guardband.

international solid-state circuits conference | 2012

A 280mV-to-1.2V wide-operating-range IA-32 processor in 32nm CMOS

Shailendra Jain; Surhud Khare; Satish Yada; V Ambili; Praveen Salihundam; Shiva Ramani; Sriram Muthukumar; Manali R Srinivasan; Arun Kumar; Shasi Kumar Gb; Rajaraman Ramanarayanan; Vasantha Erraguntla; Jason Howard; Sriram R. Vangal; Saurabh Dighe; Greg Ruhl; Paolo A. Aseron; Howard Wilson; Nitin Borkar; Vivek De; Shekhar Borkar

Near-threshold computing brings the promise of an order of magnitude improvement in energy efficiency over the current generation of microprocessors [1]. However, frequency degradation due to aggressive voltage scaling may not be acceptable across all single-threaded or performance-constrained applications. Enabling the processor to operate over a wide voltage range helps to achieve best possible energy efficiency while satisfying varying performance demands of the applications. This paper describes an IA-32 processor fabricated in 32nm CMOS technology [2], demonstrating a reliable ultra-low voltage operation and energy efficient performance across the wide voltage range from 280mV to 1.2V.

IEEE Journal of Solid-state Circuits | 2011

Within-Die Variation-Aware Dynamic-Voltage-Frequency-Scaling With Optimal Core Allocation and Thread Hopping for the 80-Core TeraFLOPS Processor

Saurabh Dighe; Sriram R. Vangal; Paolo A. Aseron; Shasi Kumar; Tiju Jacob; Keith A. Bowman; Jason Howard; James W. Tschanz; Vasantha Erraguntla; Nitin Borkar; Vivek De; Shekhar Borkar

In this paper, we present measured within-die core-to-core Fmax and leakage variation data for an 80-core processor in 65 nm CMOS and 1) populate a parameterized energy/performance model to determine the most energy-efficient operating point for a workload; 2) examine impacts of per-core clock and power gating on optimal dynamic voltage-frequency-core scaling (DVFCS) operating points; and 3) compare improvements in energy efficiency achievable by variation-aware DVFCS and core mapping on Single-Voltage/Multiple-Frequency (SVMF), Multiple-Voltage/Single-Frequency (MVSF) and Multiple-Voltage/Multiple-Frequency (MVMF) designs. Variation-aware DVFS with optimal core mapping is shown to improve energy efficiency 6%-35% across a range of compute/communication activity workloads. A new dynamic thread hopping scheme boosts performance by 5%-10% or energy efficiency by 20%-60%.

international solid state circuits conference | 2007

A 256-Kb Dual-

Muhammad M. Khellah; Dinesh Somasekhar; Yibin Ye; Nam Sung Kim; Jason Howard; Greg Ruhl; Murad Sunna; James W. Tschanz; Nitin Borkar; Fatih Hamzaoglu; Gunjan Pandya; Ali Farhang; Kevin Zhang; Vivek De

This paper addresses the stability problem of SRAM cells used in dense last level caches (LLCs). In order for the LLC not to limit the minimum voltage at which a processor core can run, a dual-VCC 256-Kb SRAM building block is proposed. A fixed high-voltage supply powers the cache which allows the use of the smallest SRAM cell for maximum density, while a separate variable supply is used by the core for ultra-low-voltage operation using dynamic voltage and frequency (DVF). Implemented in a 65-nm bulk CMOS process, the block features low overhead embedded level shifters and an actively clamped sleep transistor for maximum cache leakage power reduction during standby. Measured results show that the proposed block runs at 4.2GHz while consuming 30 mW at 85degC and 1.2V supply. Furthermore, measurements across a wide range of process, voltage, temperature, and aging conditions indicate virtual ground clamping accuracy within a few millivolts of required cache standby VMIN. Extrapolating the 256-Kb block measurement results in a large 64-Mb LLC used in a dual-V CC processor gives 35% reduction in total processor power as compared with a single-VCC processor design running at a high supply voltage

international solid-state circuits conference | 2010

{V}_{\rm CC}

Saurabh Dighe; Sriram R. Vangal; Paolo A. Aseron; Shasi Kumar; Tiju Jacob; Keith A. Bowman; Jason Howard; James W. Tschanz; Vasantha Erraguntla; Nitin Borkar; Vivek De; Shekhar Borkar

Many-core processors with on-die network-on-chip (NoC) interconnects have emerged as viable architectures for Single-Instruction/Multiple-Data (SIMD) vector applications and parallel workloads, and have been implemented in 65nm CMOS with Dynamic Voltage-Frequency Scaling (DVFS). Chips with Single-Voltage/Single-Frequency (SVSF) for all cores running homogeneous threads as well as Multiple-Voltage/Multiple-Frequency (MVMF), running heterogeneous applications and using independent V/F control for each core, have been reported. Combination of DVFS with dynamic core-count scaling (or DVFCS) has been proposed to further improve performance & energy efficiency across varying workloads. With technology scaling, both leakage power and core-to-core variations in frequency (Fmax) & leakage due to within-die device parameter variations have become significant, thus creating the need for per-core power gating and variation-aware DVFCS. Recently, variation-aware core mapping has been investigated using high level architectural simulations and statistical variation models.

Explore More