Is this you? Create Your Porfile

Henry Cook

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Henry Cook is active.

Explore More

Publication

Featured researches published by Henry Cook.

Nature | 2015

Single-chip microprocessor that communicates directly using light

Chen Sun; Mark T. Wade; Yunsup Lee; Jason S. Orcutt; Luca Alloatti; Michael Georgas; Andrew Waterman; Jeffrey M. Shainline; Rimas Avizienis; Sen Lin; Benjamin R. Moss; Rajesh Kumar; Fabio Pavanello; Amir H. Atabaki; Henry Cook; Albert J. Ou; Jonathan Leu; Yu-Hsin Chen; Krste Asanovic; Rajeev J. Ram; Miloš A. Popović; Vladimir Stojanovic

Data transport across short electrical wires is limited by both bandwidth and power density, which creates a performance bottleneck for semiconductor microchips in modern computer systems—from mobile phones to large-scale data centres. These limitations can be overcome by using optical communications based on chip-scale electronic–photonic systems enabled by silicon-based nanophotonic devices8. However, combining electronics and photonics on the same chip has proved challenging, owing to microchip manufacturing conflicts between electronics and photonics. Consequently, current electronic–photonic chips are limited to niche manufacturing processes and include only a few optical devices alongside simple circuits. Here we report an electronic–photonic system on a single chip integrating over 70 million transistors and 850 photonic components that work together to provide logic, memory, and interconnect functions. This system is a realization of a microprocessor that uses on-chip photonic devices to directly communicate with other chips using light. To integrate electronics and photonics at the scale of a microprocessor chip, we adopt a ‘zero-change’ approach to the integration of photonics. Instead of developing a custom process to enable the fabrication of photonics, which would complicate or eliminate the possibility of integration with state-of-the-art transistors at large scale and at high yield, we design optical devices using a standard microelectronics foundry process that is used for modern microprocessors. This demonstration could represent the beginning of an era of chip-scale electronic–photonic systems with the potential to transform computing system architectures, enabling more powerful computers, from network infrastructure to data centres and supercomputers.

ieee international conference on high performance computing data and analytics | 2011

CudaDMA: optimizing GPU memory bandwidth via warp specialization

Michael Bauer; Henry Cook; Brucek Khailany

As the computational power of GPUs continues to scale with Moores Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37× on representative synthetic micro-benchmarks, and 1.15×-3.2× on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs.

design automation conference | 2010

RAMP gold: an FPGA-based architecture simulator for multiprocessors

Zhangxi Tan; Andrew Waterman; Rimas Avizienis; Yunsup Lee; Henry Cook; David A. Patterson; Krste Asanovica

We present RAMP Gold, an economical FPGA-based architecture simulator that allows rapid early design-space exploration of manycore systems. The RAMP Gold prototype is a high-throughput, cycle-accurate full-system simulator that runs on a single Xilinx Virtex-5 FPGA board, and which simulates a 64-core shared-memory target machine capable of booting real operating systems. To improve FPGA implementation efficiency, functionality and timing are modeled separately and host multithreading is used in both models. We evaluate the prototypes performance using a modern parallel benchmark suite running on our manycore research operating system, achieving two orders of magnitude speedup compared to a widely-used software-based architecture simulator.

international symposium on computer architecture | 2010

A case for FAME: FPGA architecture model execution

Zhangxi Tan; Andrew Waterman; Henry Cook; Sarah Bird; Krste Asanovic; David A. Patterson

Given the multicore microprocessor revolution, we argue that the architecture research community needs a dramatic increase in simulation capacity. We believe FPGA Architecture Model Execution (FAME) simulators can increase the number of useful architecture research experiments per day by two orders of magnitude over Software Architecture Model Execution (SAME) simulators. To clear up misconceptions about FPGA-based simulation methodologies, we propose a FAME taxonomy to distinguish the costperformance of variations on these ideas. We demonstrate our simulation speedup claim with a case study wherein we employ a prototype FAME simulator, RAMP Gold, to research the interaction between hardware partitioning mechanisms and operating system scheduling policy. The study demonstrates FAMEs capabilities: we run a modern parallel benchmark suite on a research operating system, simulate 64-core target architectures with multi-level memory hierarchy timing models, and add experimental hardware mechanisms to the target machine. The simulation speedup achieved by our adoption of FAME-250×-enables experiments with more realistic time scales and data set sizes thanare possible with SAME.

international symposium on computer architecture | 2013

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Henry Cook; Miquel Moreto; Sarah Bird; Khanh Dao; David A. Patterson; Krste Asanovic

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.

design automation conference | 2008

Predictive design space exploration using genetically programmed response surfaces

Henry Cook; Kevin Skadron

Exponential increases in architectural design complexity threaten to make traditional processor design optimization techniques intractable. Genetically programmed response surfaces (GPRS) address this challenge by transforming the optimization process from a lengthy series of detailed simulations into the tractable formulation and rapid evaluation of a predictive model. We validate GPRS methodology on realistic processor design spaces and compare it to recently proposed techniques for predictive microarchitectural design space exploration.

european solid state circuits conference | 2014

A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators

Yunsup Lee; Andrew Waterman; Rimas Avizienis; Henry Cook; Chen Sun; Vladimir Stojanovic; Krste Asanovic

A 64-bit dual-core RISC-V processor with vector accelerators has been fabricated in a 45nm SOI process. This is the first dual-core processor to implement the open-source RISC-V ISA designed at the University of California, Berkeley. In a standard 40nm process, the RISC-V scalar core scores 10% higher in DMIPS/MHz than the Cortex-A5, ARMs comparable single-issue in-order scalar core, and is 49% more area-efficient. To demonstrate the extensibility of the RISC-V ISA, we integrate a custom vector accelerator alongside each single-issue in-order scalar core. The vector accelerator is 1.8× more energy-efficient than the IBM Blue Gene/Q processor, and 2.6× more than the IBM Cell processor, both fabricated in the same process. The dual-core RISC-V processor achieves maximum clock frequency of 1.3GHz at 1.2V and peak energy efficiency of 16.7 double-precision GFLOPS/W at 0.65V with an area of 3mm2.

IEEE Micro | 2016

An Agile Approach to Building RISC-V Microprocessors

Yunsup Lee; Andrew Waterman; Henry Cook; Brian Zimmer; Ben Keller; Alberto Puggelli; Jaehwa Kwak; Ruzica Jevtic; Stevo Bailey; Milovan Blagojevic; Pi-Feng Chiu; Rimas Avizienis; Brian C. Richards; Jonathan Bachrach; David A. Patterson; Elad Alon; Bora Nikolic; Krste Asanovic

The final phase of CMOS technology scaling provides continued increases in already vast transistor counts, but only minimal improvements in energy efficiency, thus requiring innovation in circuits and architectures. However, even huge teams are struggling to complete large, complex designs on schedule using traditional rigid development flows. This article presents an agile hardware development methodology, which the authors adopted for 11 RISC-V microprocessor tape-outs on modern 28-nm and 45-nm CMOS processes in the past five years. The authors discuss how this approach enabled small teams to build energy-efficient, cost-effective, and industry-competitive high-performance microprocessors in a matter of months. Their agile methodology relies on rapid iterative improvement of fabricatable prototypes using hardware generators written in Chisel, a new hardware description language embedded in a modern programming language. The parameterized generators construct highly customized systems based on the free, open, and extensible RISC-V platform. The authors present a case study of one such prototype featuring a RISC-V vector microprocessor integrated with a switched-capacitor DC-DC converter alongside an adaptive clock generator in a 28-nm, fully depleted silicon-on-insulator process.

ieee automatic speech recognition and understanding workshop | 2011

Fast speaker diarization using a high-level scripting language

Ekaterina Gonina; Gerald Friedland; Henry Cook; Kurt Keutzer

Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this paper we present a speaker diarization system captured in under 50 lines of Python that achieves 50–250× faster than real-time performance by using a specialization framework to automatically map and execute computationally intensive GMM training on an NVIDIA GPU, without significant loss in accuracy.

ieee hot chips symposium | 2013