Ronald Minnich | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ronald Minnich is active.

Explore More

Publication

Featured researches published by Ronald Minnich.

international conference on cluster computing | 2002

Supermon: a high-speed cluster monitoring system

Matthew J. Sottile; Ronald Minnich

Supermon is a flexible set of tools for high speed, scalable cluster monitoring. Node behavior can be monitored much faster than with other commonly used methods (e.g., rstatd). In addition, Supermon uses a data protocol based on symbolic expressions (S-expressions) at all levels of Supermon, from individual nodes to entire clusters. This contributes to Supermons scalability and allows it to function in a heterogeneous environment. This paper presents the Supermon architecture and discuss initial performance measurements on a cluster of heterogeneous Alpha-processor based nodes.

International Journal of Parallel Programming | 2003

A network-failure-tolerant message-passing system for terascale clusters

Richard L. Graham; Sung-Eun Choi; David Daniel; Nehal N. Desai; Ronald Minnich; Craig Edward Rasmussen; L. Dean Risinger; Mitchel W. Sukalski

The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.

international conference on cluster computing | 2004

Analysis of microbenchmarks for performance tuning of clusters

Matthew J. Sottile; Ronald Minnich

Microbenchmarks, i.e. very small computational kernels, have become commonly used for quantitative measures of node performance in clusters. For example, a commonly used benchmark measures the amount of time required to perform a fixed quantum of work. Unfortunately, this benchmark is one of many that violate well known rules from sampling theory, leading to erroneous, contradictory or misleading results. At a minimum, these types of benchmarks can not be used to identify time-based activities that may interfere with and hence limit application performance. Our original and primary goal remains to identify noise in the system due to periodic activities that are not part of user application code. We discuss why the fixed quantum of work benchmark provides data that is of limited use for analysis; and we show code for, discuss, and analyze results from a microbenchmark which follows good rules of sampling hygiene, and hence provides useful data for analysis.

field-programmable logic and applications | 2004

Monte Carlo radiative heat transfer simulation on a reconfigurable computer

Maya Gokhale; Janette Frigo; Christine Ahrens; Justin L. Tripp; Ronald Minnich

Recently, the appearance of very large (3 – 10M gate) FPGAs with embedded arithmetic units has opened the door to the possibility of floating point computation on these devices. While previous researchers have described peak performance or kernel matrix operations, there is as yet relatively little experience with mapping an application-specific floating point loop onto FPGAs. In this work, we port a supercomputer application benchmark onto Xilinx Virtex II and Virtex II Pro FPGAs and compare performance with three Pentium IV Xeon microprocessors. Our results show that this application-specific pipeline, with 12 multiply, 10 add/subtract, one divide, and two compare modules of single precision floating point data type, shows speed up of 10.37×. We analyze the trade-offs between hardware and software to characterize the algorithms that will perform well on current and future FPGA architectures.

international conference on cluster computing | 2006

XCPU: a new, 9p-based, process management system for clusters and grids

Ronald Minnich; Andrey Mirtchovski

Xcpu is a new process management system that is equally at home on clusters and grids. Xcpu provides a process execution service visible to client nodes as a 9p server. It can be presented to users as a file system if that functionality is desired. The Xcpu service builds on our earlier work with the Bproc system. Xcpu differs from traditional remote execution services in several key ways, one of the most important being its use of a push rather than a pull model, in which the binaries are pushed to the nodes by the job starter, rather than pulled from a remote file system such as NFS. Bproc used a proprietary protocol; a process migration model; and a set of kernel modifications to achieve its goals. In contrast, Xcpu uses a well-understood protocol, namely 9p; uses a non-migration model for moving the process to the remote node; and uses totally standard kernels on various operating systems such as Plan 9 and Linux to start, and MacOS and others in development. In this paper, we describe our clustering model; how Bproc implements it and how Xcpu implements a similar, but not identical model. We describe in some detail the structure of the various Xcpu components. Finally, we close with a discussion of Xcpu performance, as measured on several clusters at LANL, including the 1024-node Pink cluster, and the 256-node Blue Steel InfiniBand cluster

Operating Systems Review | 2006

Right-weight kernels: an off-the-shelf alternative to custom light-weight kernels

Ronald Minnich; Matthew J. Sottile; Sung-Eun Choi; Erik Hendriks; Jim McKie

DOE has put forth a considerable effort into light-weight kernels for high performance computing, yet there has been a lack of acceptance and use, perhaps due to the limited support for hardware and software. The arguments for light-weight kernels have been based on the problem of interference, i.e. changes in application performance that occur when the operating system pre-empts the application. Nevertheless, using existing, well supported operating systems for HPC systems has been quite successful. The problems with the standard operating systems remain, however, although their impact on applications is still not quantified. At LANL, we have undertaken a research program to determine whether Linux and/or Plan 9 can be used to realize the benefits of light-weight kernels while maintaining the benefits of a full-featured operating system. Specifically, we are evaluating measures that quantify what is good in a Light-Weight Kernel. We are using this knowledge to modify Linux and Plan 9 to make them competitive with custom light weight operating systems, in essence, a Right-Weight Kernel. This paper represents both a summary of early results as well as a description of work in progress.

The Journal of Supercomputing | 2006

How to build a fast and reliable 1024 node cluster with only one disk

Erik Hendriks; Ronald Minnich

In the last year LANL has constructed a 1408-node AMD Opteron cluster, a 1024-node Intel P4 Xeon cluster, a 256-node AMD Opteron cluster and two 128-node Intel P4 Xeon clusters. Each of these clusters is controlled by one front-end node, and each cluster needs only one disk in the front-end node for production operations. In this paper we describe the software architecture that boots and manages these clusters. This software architecture represents a clean break from the way that clusters have been set up for the last 14 years. We show the ways that this architecture has been used to greatly improve the operation of the nodes, with particular emphasis on improvements in boot-time performance, scalability, and reliability.

international conference on cluster computing | 2004

Give your bootstrap the boot: using the operating system to boot the operating system

Ronald Minnich

One of the slowest and most annoying aspects of system management is the simple act of rebooting the system. The sysadmin starts from a known state

ieee international conference on high performance computing data and analytics | 2004

Pink: a 1024-node single-system image Linux cluster

G.R. Watson; Matthew J. Sottile; Ronald Minnich; Sung-Eun Choi; E.A. Hertdriks

the OS is running - and hands the computer over to an untrustworthy piece of software. With enough nodes involved, there is a certain chance that the process will fail on one of them. Bootstrapping is well named - it takes the system down to a low level, from which return is uncertain. It would be much better if we could use the known, trusted OS software to manage the boot process. The OS can apply all its power to the problem of locating, verifying, and loading a new OS image. Error checking and feedback can be far more robust. We discuss five systems for Linux and Plan 9 that allow the OS to boot the OS. These systems allow for the complete elimination of old-fashioned bootstrap.

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4 | 2000

The linux BIOS

Ronald Minnich; James Hendricks; Dale Webster

This work describes our experience of designing and building Pink, a 1024-node (2048 processor) Myrinet-based single-system image Linux cluster that was installed in January 2003 at the Los Alamos National Laboratory. At the time of its installation, Pink was the largest single-system image Linux cluster in the world, and was based entirely on open-source software - from the BIOS up. Pink was the proof-of-concept prototype for Lightning, a production 1408-node (2816 processor) cluster that begin operation at LANL. Lightning is currently number 6 on the Top500 list. In This work we examine the issues that were encountered and the problems that needed to be overcome in order to scale a cluster to this size. We also present some performance numbers that demonstrate the scalability and manageability of the cluster software suite.

Explore More