Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Dirk Hoenicke is active.

Publication


Featured researches published by Dirk Hoenicke.


Ibm Journal of Research and Development | 2005

Overview of the Blue Gene/L system architecture

Alan Gara; Matthias A. Blumrich; Dong Chen; George Liang-Tai Chiu; Paul W. Coteus; Mark E. Giampapa; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Thomas A. Liebsch; Martin Ohmacht; Burkhard Steinmacher-Burow; Todd E. Takken; Pavlos M. Vranas

The Blue Gene®/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based approach and rationale for a low-power, highly integrated design. The key architectural features of Blue Gene/L are introduced in this paper: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.


Ibm Journal of Research and Development | 2005

Blue Gene/L advanced diagnostics environment

Mark E. Giampapa; Ralph Bellofatto; Matthias A. Blumrich; Dong Chen; Marc Boris Dombrowa; Alan Gara; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Ben J. Nathanson; Burkhard Steinmacher-Burow; Martin Ohmacht; Valentina Salapura; Pavlos M. Vranas

This paper describes the Blue Gene®/L advanced diagnostics environment (ADE) used throughout all aspects of the Blue Gene/L project, including design, logic verification, bring-up, diagnostics, and manufacturing test. The Blue Gene/L ADE consists of a lightweight multithreaded coherence-managed kernel, runtime libraries, device drivers, system programming interfaces, compilers, and host-based development tools. It provides complete and flexible access to all features of the Blue Gene/L hardware. Prior to the existence of hardware, ADE was used on Very high-speed integrated circuit Hardware Description Language (VHDL) models, not only for logic verification, but also for performance measurements, code-path analysis, and evaluation of architectural tradeoffs. During early hardware bring-up, the ability to run in a cycle-reproducible manner on both hardware and VHDL proved invaluable in fault isolation and analysis. However, ADE is also capable of supporting high-performance applications and parallel test cases, thereby permitting us to stress the hardware to the limits of its capabilities. This paper also provides insights into system-level and device-level programming of Blue Gene/L to assist developers of high-performance applications o more fully exploit the performance of the machine.


International Journal of Parallel Programming | 2007

The blue gene/L supercomputer: a hardware and software story

José E. Moreira; Valentina Salapura; George S. Almasi; Charles J. Archer; Ralph Bellofatto; Peter Edward Bergner; Randy Bickford; Matthias A. Blumrich; José R. Brunheroto; Arthur A. Bright; Michael Brian Brutman; José G. Castaños; Dong Chen; Paul W. Coteus; Paul G. Crumley; Sam Ellis; Thomas Eugene Engelsiepen; Alan Gara; Mark E. Giampapa; Tom Gooding; Shawn A. Hall; Ruud A. Haring; Roger L. Haskin; Philip Heidelberger; Dirk Hoenicke; Todd A. Inglett; Gerard V. Kopcsay; Derek Lieber; David Roy Limpert; Patrick Joseph McCarthy

The Blue Gene/L system at the Department of Energy Lawrence Livermore National Laboratory in Livermore, California is the world’s most powerful supercomputer. It has achieved groundbreaking performance in both standard benchmarks as well as real scientific applications. In that process, it has enabled new science that simply could not be done before. Blue Gene/L was developed by a relatively small team of dedicated scientists and engineers. This article is both a description of the Blue Gene/L supercomputer as well as an account of how that system was designed, developed, and delivered. It reports on the technical characteristics of the system that made it possible to build such a powerful supercomputer. It also reports on how teams across the world worked around the clock to accomplish this milestone of high-performance computing.


computing frontiers | 2005

Power and performance optimization at the system level

Valentina Salapura; Randy Bickford; Matthias A. Blumrich; Arthur A. Bright; Dong Chen; Paul W. Coteus; Alan Gara; Mark E. Giampapa; Michael Karl Gschwind; Manish Gupta; Shawn A. Hall; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Martin Ohmacht; Rick A. Rand; Todd E. Takken; Pavlos M. Vranas

The BlueGene/L supercomputer has been designed with a focus on power/performance efficiency to achieve high application performance under the thermal constraints of common data centers. To achieve this goal, emphasis was put on system solutions to engineer a power-efficient system. To exploit thread level parallelism, the BlueGene/L system can scale to 64 racks with a total of 65536 computer nodes consisting of a single compute ASIC integrating all system functions with two industry-standard PowerPC microprocessor cores in a chip multiprocessor configuration. Each PowerPC processor exploits data-level parallelism with a high-performance SIMD oating point unitTo support good application scaling on such a massive system, special emphasis was put on efficient communication primitives by including five highly optimized communification networks. After an initial introduction of the Blue-Gene/L system architecture, we analyze power/performance efficiency for the BlueGene system using performance and power characteristics for the overall system performance (as exemplified by peak performance numbers.To understand application scaling behavior, and its impact on performance and power/performance efficiency, we analyze the NAMD molecular dynamics package using the ApoA1 benchmark. We find that even for strong scaling problems, BlueGene/L systems can deliver superior performance scaling and deliver significant power/performance efficiency. Application benchmark power/performance scaling for the voltage-invariant energy delay 2 power/performance metric demonstrates that choosing a power-efficient 700MHz embedded PowerPC processor core and relying on application parallelism was the right decision to build a powerful, and power/performance efficient system


Ibm Journal of Research and Development | 2005

Blue Gene/L compute chip: memory and Ethernet subsystem

Martin Ohmacht; Reinaldo A. Bergamaschi; Subhrajit Bhattacharya; Alan Gara; Mark E. Giampapa; Balaji Gopalsamy; Ruud A. Haring; Dirk Hoenicke; David John Krolak; James A. Marcella; Ben J. Nathanson; Valentina Salapura; Michael E. Wazlowski

The Blue Gene®/L compute chip is a dual-processor system-on-a-chip capable of delivering an arithmetic peak performance of 5.6 gigaflops. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 (L2 and L3) to reduce memory access time. A Gigabit Ethernet interface driven by direct memory access (DMA) is integrated in the cache hierarchy, requiring only an external physical link layer chip to connect to the media. The integrated L3 cache stores a total of 4 MB of data, using multibank embedded dynamic random access memory (DRAM). The 1,024-bit-wide data port of the embedded DRAM provides 22.4 GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine. To reduce hardware overhead due to cache coherence intervention requests, memory coherence is maintained by software. This is particularly efficient for regular highly parallel applications with partitionable working sets. The system further integrates an on-chip double-data-rate (DDR) DRAM controller for direct attachment of main memory modules to optimize overall memory performance and cost. For booting the system and low-latency interprocessor communication and synchronization, a 16-KB static random access memory (SRAM) and hardware locks have been added to the design.


Ibm Journal of Research and Development | 2005

Blue Gene/L compute chip: control, test, and bring-up infrastructure

Ruud A. Haring; Ralph Bellofatto; Arthur A. Bright; Paul G. Crumley; Marc Boris Dombrowa; Steve M. Douskey; Matthew R. Ellavsky; Balaji Gopalsamy; Dirk Hoenicke; Thomas A. Liebsch; James A. Marcella; Martin Ohmacht

The Blue Gene®/L compute (BLC) and Blue Gene/L link (BLL) chips have extensive facilities for control, bring-up, self-test, debug, and nonintrusive performance monitoring built on a serial interface compliant with IEEE Standard 1149.1. Both the BLL and the BLC chips contain a standard eServer™ chip JTAG controller called the access macro. For BLC, the capabilities of the access macro were extended 1) to accommodate the secondary JTAG controllers built into embedded PowerPC® cores; 2) to provide direct access to memory for initial boot code load and for messaging between the service node and the BLC chip; 3) to provide nonintrusive access to device control registers; and 4) to provide a suite of chip configuration and control registers. The BLC clock tree structure is described. It accommodates both functional requirements and requirements for enabling multiple built-in self-test domains, differentiated both by frequency and functionality. The chip features a debug port that allows observation of critical chip signals at full speed.


symposium on computer architecture and high performance computing | 2005

Data cache prefetching design space exploration for BlueGene/L supercomputer

José R. Brunheroto; Valentina Salapura; Fernando F. Redigolo; Dirk Hoenicke; Alan Gara

Scientific applications exhibit good spatial and temporal data memory access locality. It is possible to hide memory latency for the level 3 cache, and reduce contention between multiple cores sharing a single level 3 cache, by using a prefetch cache to identify data streams which can be profitably prefetched, and decouple the cache line size mismatch between L3 cache and the level 1 data cache. In this work, a design space exploration is presented, which helped shape the design of the BlueGene/L supercomputer memory sub-system. The prefetch cache consists of a small number of 128 line buffers that speculatively prefetches data from the L3 cache, since applications present some sequential access pattern, this prefetching scheme increases the likelihood that a request from the level 1 data cache was present in the prefetch cache. Since most compute intensive applications contain a small number of data streams, it is sufficient for the prefetch cache to have small number of line buffers to track and detect the data streams. This paper focuses on the evaluation of stream detection mechanisms and the influence of varying the replacement policies for stream prefetch caches.


Ibm Journal of Research and Development | 2005

Verification strategy for the Blue Gene/L chip

Michael E. Wazlowski; Narasimha R. Adiga; Daniel K. Beece; Ralph Bellofatto; Matthias A. Blumrich; Dong Chen; Marc Boris Dombrowa; Alan Gara; Mark E. Giampapa; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Ben J. Nathanson; Martin Ohmacht; R. Sharrar; Sarabjeet Singh; Burkhard Steinmacher-Burow; Robert B. Tremaine; Mickey Tsao; A. R. Umamaheshwaran; Pavlos M. Vranas

The Blue Gene®/L compute chip contains two PowerPC® 440 processor cores, private L2 prefetch caches, a shared L3 cache and double-data-rate synchronous dynamic random access memory (DDR SDRAM) memory controller, a collective network interface, a torus network interface, a physical network interface, an interrupt controller, and a bridge interface to slower devices. System-on-a-chip verification problems require a multilevel verification strategy in which the strengths of each layer offset the weaknesses of another layer. The verification strategy we adopted relies on the combined strengths of random simulation, directed simulation, and code-driven simulation at the unit and system levels. The strengths and weaknesses of the various techniques and our reasons for choosing them are discussed. The verification platform is based on event simulation and cycle simulation running on a farm of Intel-processor-based machines, several PowerPC-processor-based machines, and the internally developed hardware accelerator Awan. The cost/performance tradeoffs of the different platforms are analyzed. The success of the first Blue Gene/L nodes, which worked within days of receiving them and had only a small number of undetected bugs (none fatal), reflects both careful design and a comprehensive verification strategy.


Ibm Journal of Research and Development | 2005

Blue Gene/L compute chip: synthesis, timing, and physical design

Arthur A. Bright; Ruud A. Haring; Marc Boris Dombrowa; Martin Ohmacht; Dirk Hoenicke; Sarabjeet Singh; James A. Marcella; Robert F. Lembach; Steve M. Douskey; Matthew R. Ellavsky; Christian G. Zoellin; Alan Gara

As one of the most highly integrated system-on-a-chip application-specific integrated circuits (ASICs) to date, the Blue Gene®/L compute chip presented unique challenges that required extensions of the standard ASIC synthesis, timing, and physical design methodologies. We describe the design flow from floorplanning through synthesis and timing closure to physical design, with emphasis on the novel features of this ASIC. Among these are a process to easily inject datapath placements for speed-critical circuits or to relieve wire congestion, and a timing closure methodology that resulted in timing closure for both nominal and worst-case timing specifications. The physical design methodology featured removal of the pre-physical-design buffering to improve routability and visualization of buses, and it featured strategic seeding of buffers to close wiring and timing and end up at 90% utilization of total chip area. Robustness was enhanced by using additional input/output (I/O) and internal decoupling capacitors and by increasing I/O-to-C4 wire widths.


symposium on computer architecture and high performance computing | 2004

The eDRAM based L3-cache of the BlueGene/L supercomputer processor node

Martin Ohmacht; Dirk Hoenicke; Ruud A. Haring; Alan Gara

BlueGene/L is a supercomputer consisting of 64K dual-processor system-on-a-chip compute nodes, capable of delivering an arithmetic peak performance of 5.6Gflops per node. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy for each node. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 to reduce memory access time. The integrated L3-cache stores a total of 4MB of data, using multibank embedded DRAM. The 1024 bit wide data port of the embedded DRAM provides 22.4GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine.

Researchain Logo
Decentralizing Knowledge