Ruud A. Haring | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ruud A. Haring is active.

Explore More

Publication

Featured researches published by Ruud A. Haring.

Ibm Journal of Research and Development | 2005

Overview of the Blue Gene/L system architecture

Alan Gara; Matthias A. Blumrich; Dong Chen; George Liang-Tai Chiu; Paul W. Coteus; Mark E. Giampapa; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Thomas A. Liebsch; Martin Ohmacht; Burkhard Steinmacher-Burow; Todd E. Takken; Pavlos M. Vranas

The Blue Gene®/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based approach and rationale for a low-power, highly integrated design. The key architectural features of Blue Gene/L are introduced in this paper: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.

international symposium on microarchitecture | 2012

The IBM Blue Gene/Q Compute Chip

Ruud A. Haring; Martin Ohmacht; Thomas W. Fox; Michael Karl Gschwind; David L. Satterfield; Krishnan Sugavanam; Paul W. Coteus; Philip Heidelberger; Matthias A. Blumrich; Robert W. Wisniewski; Alan Gara; George Liang-Tai Chiu; Peter A. Boyle; Norman H. Chist; Changhoan Kim

Blue Gene/Q aims to build a massively parallel high-performance computing system out of power-efficient processor chips, resulting in power-efficient, cost-efficient, and floor-space- efficient systems. Focusing on reliability during design helps with scaling to large systems and lowers the total cost of ownership. This article examines the architecture and design of the Compute chip, which combines processors, memory, and communication functions on a single chip.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 1998

JiffyTune: circuit optimization using time-domain sensitivities

Andrew R. Conn; Paula Kristine Coulman; Ruud A. Haring; Gregory L. Morrill; Chandramouli Visweswariah; Chai Wah Wu

Automating the transistor and wire-sizing process is an important step toward being able to rapidly design high-performance, custom circuits. This paper presents a circuit optimization tool that automates the tuning task by means of state-of-the-art nonlinear optimization. It makes use of a fast circuit simulator and a general-purpose nonlinear optimization package. It includes minimax and power optimization, simultaneous transistor and wire tuning, general choices of objective functions and constraints, and recovery from nonworking circuits. In addition, the tool makes use of designer-friendly interfaces that automate the specification of the optimization task, the running of the optimizer, and the back-annotation of the results of optimization onto the circuit schematic. Particularly for large circuits, gradient computation is usually the bottleneck in the optimization procedure. In addition to traditional adjoint and direct methods, we use a technique called the adjoint Lagrangian method, which computes all the gradients necessary for one iteration of optimization in a single adjoint analysis. This paper describes the algorithms and the environment in which they are used and presents extensive circuit optimization results. A circuit with 6900 transistors, 4128 tunable transistors, and 60 independent parameters was optimized in about 108 min of CPU time on an IBM RISC/System 6000, model 590.

international conference on computer aided design | 1998

Noise considerations in circuit optimization

Andrew R. Conn; Ruud A. Haring; Chandramouli Visweswariah

Noise can cause digital circuits to switch incorrectly and thus produce spurious results. Noise can also have adverse power, timing and reliability effects. Dynamic logic is particularly susceptible to charge-sharing and coupling noise. Thus, the design and optimization of a circuit should take noise considerations into account. Such considerations are typically stated as semi-infinite constraints. In addition, the number of signals to be checked and the number of sub-intervals of time during which the checking must be performed can potentially be very large. Thus, the practical incorporation of noise constraints during circuit optimization is a hitherto unsolved problem. This paper describes a novel method for incorporating noise considerations during automatic circuit optimization. Semi-infinite constraints representing noise considerations are first converted to ordinary equality constraints involving time integrals, which are readily computed in the context of circuit optimization based on time-domain simulation. Next, the gradients of these integrals are computed by the adjoint method. By using an augmented Lagrangian optimization merit function, the adjoint method is applied to compute all the necessary gradients required for optimization in a single adjoint analysis, no matter how many noise measurements are considered, and irrespective of the dimensionality of the problem. Numerical results are presented.

Ibm Journal of Research and Development | 2005

Blue Gene/L advanced diagnostics environment

Mark E. Giampapa; Ralph Bellofatto; Matthias A. Blumrich; Dong Chen; Marc Boris Dombrowa; Alan Gara; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Ben J. Nathanson; Burkhard Steinmacher-Burow; Martin Ohmacht; Valentina Salapura; Pavlos M. Vranas

This paper describes the Blue Gene®/L advanced diagnostics environment (ADE) used throughout all aspects of the Blue Gene/L project, including design, logic verification, bring-up, diagnostics, and manufacturing test. The Blue Gene/L ADE consists of a lightweight multithreaded coherence-managed kernel, runtime libraries, device drivers, system programming interfaces, compilers, and host-based development tools. It provides complete and flexible access to all features of the Blue Gene/L hardware. Prior to the existence of hardware, ADE was used on Very high-speed integrated circuit Hardware Description Language (VHDL) models, not only for logic verification, but also for performance measurements, code-path analysis, and evaluation of architectural tradeoffs. During early hardware bring-up, the ability to run in a cycle-reproducible manner on both hardware and VHDL proved invaluable in fault isolation and analysis. However, ADE is also capable of supporting high-performance applications and parallel test cases, thereby permitting us to stress the hardware to the limits of its capabilities. This paper also provides insights into system-level and device-level programming of Blue Gene/L to assist developers of high-performance applications o more fully exploit the performance of the machine.

International Journal of Parallel Programming | 2007

The blue gene/L supercomputer: a hardware and software story

José E. Moreira; Valentina Salapura; George S. Almasi; Charles J. Archer; Ralph Bellofatto; Peter Edward Bergner; Randy Bickford; Matthias A. Blumrich; José R. Brunheroto; Arthur A. Bright; Michael Brian Brutman; José G. Castaños; Dong Chen; Paul W. Coteus; Paul G. Crumley; Sam Ellis; Thomas Eugene Engelsiepen; Alan Gara; Mark E. Giampapa; Tom Gooding; Shawn A. Hall; Ruud A. Haring; Roger L. Haskin; Philip Heidelberger; Dirk Hoenicke; Todd A. Inglett; Gerard V. Kopcsay; Derek Lieber; David Roy Limpert; Patrick Joseph McCarthy

The Blue Gene/L system at the Department of Energy Lawrence Livermore National Laboratory in Livermore, California is the world’s most powerful supercomputer. It has achieved groundbreaking performance in both standard benchmarks as well as real scientific applications. In that process, it has enabled new science that simply could not be done before. Blue Gene/L was developed by a relatively small team of dedicated scientists and engineers. This article is both a description of the Blue Gene/L supercomputer as well as an account of how that system was designed, developed, and delivered. It reports on the technical characteristics of the system that made it possible to build such a powerful supercomputer. It also reports on how teams across the world worked around the clock to accomplish this milestone of high-performance computing.

computing frontiers | 2005

Power and performance optimization at the system level

Valentina Salapura; Randy Bickford; Matthias A. Blumrich; Arthur A. Bright; Dong Chen; Paul W. Coteus; Alan Gara; Mark E. Giampapa; Michael Karl Gschwind; Manish Gupta; Shawn A. Hall; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Martin Ohmacht; Rick A. Rand; Todd E. Takken; Pavlos M. Vranas

The BlueGene/L supercomputer has been designed with a focus on power/performance efficiency to achieve high application performance under the thermal constraints of common data centers. To achieve this goal, emphasis was put on system solutions to engineer a power-efficient system. To exploit thread level parallelism, the BlueGene/L system can scale to 64 racks with a total of 65536 computer nodes consisting of a single compute ASIC integrating all system functions with two industry-standard PowerPC microprocessor cores in a chip multiprocessor configuration. Each PowerPC processor exploits data-level parallelism with a high-performance SIMD oating point unitTo support good application scaling on such a massive system, special emphasis was put on efficient communication primitives by including five highly optimized communification networks. After an initial introduction of the Blue-Gene/L system architecture, we analyze power/performance efficiency for the BlueGene system using performance and power characteristics for the overall system performance (as exemplified by peak performance numbers.To understand application scaling behavior, and its impact on performance and power/performance efficiency, we analyze the NAMD molecular dynamics package using the ApoA1 benchmark. We find that even for strong scaling problems, BlueGene/L systems can deliver superior performance scaling and deliver significant power/performance efficiency. Application benchmark power/performance scaling for the voltage-invariant energy delay 2 power/performance metric demonstrates that choosing a power-efficient 700MHz embedded PowerPC processor core and relying on application parallelism was the right decision to build a powerful, and power/performance efficient system

Ibm Journal of Research and Development | 2005

Blue Gene/L compute chip: memory and Ethernet subsystem

Martin Ohmacht; Reinaldo A. Bergamaschi; Subhrajit Bhattacharya; Alan Gara; Mark E. Giampapa; Balaji Gopalsamy; Ruud A. Haring; Dirk Hoenicke; David John Krolak; James A. Marcella; Ben J. Nathanson; Valentina Salapura; Michael E. Wazlowski

The Blue Gene®/L compute chip is a dual-processor system-on-a-chip capable of delivering an arithmetic peak performance of 5.6 gigaflops. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 (L2 and L3) to reduce memory access time. A Gigabit Ethernet interface driven by direct memory access (DMA) is integrated in the cache hierarchy, requiring only an external physical link layer chip to connect to the media. The integrated L3 cache stores a total of 4 MB of data, using multibank embedded dynamic random access memory (DRAM). The 1,024-bit-wide data port of the embedded DRAM provides 22.4 GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine. To reduce hardware overhead due to cache coherence intervention requests, memory coherence is maintained by software. This is particularly efficient for regular highly parallel applications with partitionable working sets. The system further integrates an on-chip double-data-rate (DDR) DRAM controller for direct attachment of main memory modules to optimize overall memory performance and cost. For booting the system and low-latency interprocessor communication and synchronization, a 16-KB static random access memory (SRAM) and hardware locks have been added to the design.

international solid-state circuits conference | 2005

Creating the BlueGene/L supercomputer from low-power SoC ASICs

Arthur A. Bright; Matthew R. Ellavsky; Alan Gara; Ruud A. Haring; Gerard V. Kopcsay; Robert F. Lembach; James A. Marcella; Martin Ohmacht; Valentina Salapura

An overview of the design aspects of the BlueGene/L chip, the heart of the BlueGene/L supercomputer, is presented. Following an SoC approach, processors, memory and communication subsystems are integrated into one low-power chip. The high-density system packaging of the BlueGene/L system provides better power and cost performance.

international conference on computer aided design | 1997

Circuit optimization via adjoint Lagrangians

Andrew R. Conn; Ruud A. Haring; Chandramouli Visweswariah; Chai Wah Wu

The circuit tuning problem is best approached by means of gradient-based nonlinear optimization algorithms. For large circuits, gradient computation can be the bottleneck in the optimization procedure. Traditionally, when the number of measurements is large relative to the number of tunable parameters, the direct method is used to repeatedly solve the associated sensitivity circuit to obtain all the necessary gradients. Likewise, when the parameters outnumber the measurements, the adjoint method is employed to solve the adjoint circuit repeatedly for each measurement to compute the sensitivities. In this paper, we propose the adjoint Lagrangian method, which computes all the gradients necessary for augmented-Lagrangian-based optimization in a single adjoint analysis. After the nominal simulation of the circuit has been carried out, the gradients of the merit function are expressed as the gradients of a weighted sum of circuit measurements. The weights are dependent on the nominal solution and on optimizer quantities such as Lagrange multipliers. By suitably choosing the excitations of the adjoint circuit, the gradients of the merit function are computed via a single adjoint analysis, irrespective of the number of measurements and the number of parameters of the optimization. This procedure requires close integration between the nonlinear optimization software and the circuit simulation program. The adjoint Lagrangian formulation has been implemented in the JiffyTune tool which optimizes delay, area, slew (transition time) and power measurements by adjusting transistor widths and wire sizes. Speedups of over 35x have been realized in the gradient computation procedure by using the adjoint Lagrangian formulation, leading to speedups of up to 2.5x in the overall optimization procedure. Perhaps more importantly, these speedups have rendered feasible the tuning of large circuits. A circuit with 6,900 transistors was optimized in under two hours of CPU time.

Explore More