Daniel Alan Brokenshire

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Alan Brokenshire is active.

Explore More

Publication

Featured researches published by Daniel Alan Brokenshire.

IEEE Journal of Solid-state Circuits | 2006

The microarchitecture of the synergistic processor for a cell processor

Brian Flachs; Shigehiro Asano; Sang Hoo Dhong; Harm Peter Hofstee; Gilles Gervais; Roy Kim; T. Le; Peichun Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; Hwa-Joon Oh; Silvia Melitta Mueller; Osamu Takahashi; A. Hatakeyama; Yukio Watanabe; Naoka Yano; Daniel Alan Brokenshire; Mohammad Peyravian; Vandung To; E. Iwata

This paper describes an 11 FO4 streaming data processor in the IBM 90-nm SOI-low-k process. The dual-issue, four-way SIMD processor emphasizes achievable performance per area and power. Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density. The design minimizes instruction latency while providing for fine grain clock control to reduce power.

Ibm Journal of Research and Development | 2007

Introduction to the cell broadband engine architecture

Charles Ray Johns; Daniel Alan Brokenshire

This paper provides an overview of the Cell Broadband Engine™ Architecture (CBEA). The CBEA defines a revolutionary extension to a more conventional processor organization and serves as the basis for the development of microprocessors targeted at the computer entertainment, multimedia, and real-time market segments. In this paper, the organization of the architecture is described, as well as the instruction set, commands, and facilities defined in the architecture. In many cases, the motivation for these facilities is explained and examples are provided to illustrate their intended use. In addition, this paper introduces the Software Development Kit and the software standards for a CBEA-compliant processor.

Ibm Journal of Research and Development | 2009

Programming the Linpack benchmark for Roadrunner

Michael Kistler; John A. Gunnels; Daniel Alan Brokenshire; Brad Benton

We describe the challenges and opportunities we encountered when developing a hybrid version of the Linpack benchmark for the Los Alamos National Laboratory Roadrunner supercomputing system, which combines traditional x86-64 host processors with IBM PowerXCell™ 8i accelerator processors. The challenges included determining the proper division of the host and accelerator roles in the computation, transfer of data between the host and accelerator memory domains, alignment of data for communication and computation, and data format differences between the two processors. We also describe our approach to modeling the performance of the hybrid system and compare our performance estimates to witnessed performance on the system at different scales and levels of memory consumption. Through careful attention to these issues, we have produced a hybrid version of the Linpack benchmark for the Roadrunner system that achieves 77.8% of peak performance on a single compute node and 74.6% of peak performance over the entire system, making this system the first to achieve a Linpack result exceeding one petaflops (1015 floating-point operations per second).

Ibm Journal of Research and Development | 2007

Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI

Brian Flachs; S. Asano; Sang Hoo Dhong; Harm Peter Hofstee; Gilles Gervais; Roy Moonseuk Kim; T. N. Le; P. Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; H.-J. Oh; Stefan Mueller; Osamu Takahashi; K. Hirairi; A. Kawasumii; H. Murakami; H. Noro; S. Onishi; J. Pille; J. Silberman; S. Yong; A. Hatakeyama; Y. Watanabe; Naoka Yano; Daniel Alan Brokenshire; Mohammad Peyravian; V. To; Eiji Iwata

This paper describes the architecture and implementation of the original gaming-oriented synergistic processor element (SPE) in both 90-nm and 65-nm silicon-on-insulator (SOI) technology and introduces a new SPE implementation targeted for the high-performance computing community. The Cell Broadband Engine™ processor contains eight SPEs. The dual-issue, four-way single-instruction multiple-data processor is designed to achieve high performance per area and power and is optimized to process streaming data, simulate physical phenomena, and render objects digitally. Most aspects of data movement and instruction flow are controlled by software to improve the performance of the memory system and the core performance density. The SPE was designed as an 11-F04 (fan-out-of-4-inverter-delay) processor using 20.9 million transistors within 14.8 mm 2 using the IBM 90-nm SOI low-k process. CMOS (complementary metal-oxide semiconductor) static gates implement the majority of the logic. Dynamic circuits are used in critical areas and occupy 19% of the non-static random access memory (SRAM) area. Instruction set architecture, microarchitecture, and physical implementation are tightly coupled to achieve a compact and power-efficient design. Correct operation has been observed at up to 5.6 GHz and 7.3 GHz, respectively, in 90-nm and 65-nm SOI technology.

ieee international conference on high performance computing data and analytics | 2009

Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Michael Kistler; John A. Gunnels; Daniel Alan Brokenshire; Brad Benton

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ 2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.

asia and south pacific design automation conference | 2006

An SPU reference model for simulation, random test generation and verification

Yukio Watanabe; Balazs Sallay; Brad W. Michael; Daniel Alan Brokenshire; Gavin B. Meil; Hazim Shafi; Daisuke Hiraoka

An instruction set level reference model was developed for the development of synergistic processing unit (SPU), which is one of the key components of the cell processor [Pham, 2005][Flachs, 2005]. This reference model was used for the simulators to define the instruction set architecture (ISA), for the random test case generator, for the reference in the verification environment and for the software development. Using the same reference model for multiple purposes made it easier to keep up with the architecture changes at the early stage of the microprocessor development. Also including the reference model in the simulation environment increased the robustness for the random test executions and made it possible to find bugs that are usually difficult to catch.

international symposium on performance analysis of systems and software | 2011

Detecting race conditions in asynchronous DMA operations with full system simulation

Michael Kistler; Daniel Alan Brokenshire

In this paper, we describe a technique for detecting race conditions between direct memory access (DMA) operations and load/store instructions using a full system simulator. Our approach uses event monitoring features of a full system simulator to monitor DMA operations and the memory areas they access and detect conflicting accesses that could represent races conditions. Our race condition checker tracks DMA operations from the time they are issued until they are architecturally guaranteed to be complete, rather than simply tracking when they actually complete, and thus detects race conditions in programs even when the actual data accesses do not occur out of order. This feature is valuable because the mechanisms for ensuring ordering of asynchronous DMA operations are complex and often poorly understood by application programmers. These DMA operations may conflict with each other or with loads and stores performed by processor that initiated the operations, creating ample opportunity for race conditions to occur. We describe our race condition checker in detail and show how it can be used to easily detect race conditions in DMA operations initiated by special purpose cores.

Archive | 2002