James H. Laros | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James H. Laros is active.

Explore More

Publication

Featured researches published by James H. Laros.

ieee international conference on high performance computing data and analytics | 2011

Evaluating the viability of process replication reliability for exascale systems

Kurt Brian Ferreira; Jon Stearley; James H. Laros; Ron A. Oldfield; Kevin Pedretti; Ronald B. Brightwell; Rolf Riesen; Patrick G. Bridges; Dorian C. Arnold

As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an applications time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to check-point/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.

2013 International Green Computing Conference Proceedings | 2013

PowerInsight - A commodity power measurement capability

James H. Laros; Phil Pokorny; David DeBonis

The challenge of balancing between power and performance is now well established. While research in this area is well underway, the ability to measure power and energy in situ has remained an obstacle. This problem is magnified in the field of High Performance Computing (HPC). To meet this challenge, a device called PowerInsight has been designed to accomplish component level power and energy instrumentation of commodity hardware. PowerInsight was designed by Penguin Computing, in close cooperation with Sandia National Laboratories, to further power and energy research in HPC and other areas. This paper documents the design and development of PowerInsight, hardware and software. Validation of the functionality of PowerInsight was done during design and development as well as experimentally after integrating the first PowerInsight devices into a commodity cluster. This paper only begins to show the wide range of impact this level of power and energy instrumentation can have on a range of architectural and application research and analysis topics.

Archive | 2011

rMPI : increasing fault resiliency in a message-passing environment.

Jon Stearley; James H. Laros; Kurt Brian Ferreira; Kevin Pedretti; Ron A. Oldfield; Rolf Riesen; Ronald Brian Brightwell

As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.

international conference on cluster computing | 2009

Topics on measuring real power usage on high performance computing platforms

James H. Laros; Kevin Pedretti; Suzanne M. Kelly; John P. VanDyke; Kurt Brian Ferreira; Mark Swan

Power has recently been recognized as one of the major obstacles in fielding a Peta-FLOPs class system. To reach Exa-FLOPs, the challenge will certainly be compounded. In this paper we will discuss a number of High Performance Computing power related topics. We first describe our implementation of a scalable power measurement framework that has enabled us to examine real power use (current draw). [Using this framework, samples were obtained at a per-node (socket) granularity, at frequencies of up to 100 samples per second.] Additionally, we describe how we applied this capability to implement power conserving measures on our Catamount Light Weight Kernel, where we achieved an 80% improvement. This ability has enabled us to quantify the amount of energy used by applications and to contrast application energy use between a Light Weight and General Purpose operating system. Finally, we show application energy use increases proportionally with the increase in run-time due to operating system noise. Areas of future interest will also be discussed.

Archive | 2010

Redundant computing for exascale systems.

Jon Stearley; Rolf Riesen; James H. Laros; Kurt Brian Ferreira; Kevin Pedretti; Ron A. Oldfield; Ronald B. Brightwell

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience.

international conference on cluster computing | 2007

Red storm IO performance analysis

James H. Laros; Lee Ward; Ruth Klundt; Sue Kelly; James L. Tomkins; Brian R. Kellogg

This paper will summarize an IO performance analysis effort performed on Sandia National Laboratories Red Storm platform. Our goal was to examine the IO system performance and identify problems or bottle-necks in any aspect of the IO sub-system. Our process examined the entire IO path from application to disk both in segments and as a whole. Our final analysis was performed at scale employing parallel IO access methods typically used in high performance computing applications.

Archive | 2013

Energy-Efficient High Performance Computing

James H. Laros; Kevin Pedretti; Suzanne M. Kelly; Wei Shu; Kurt Brian Ferreira; John Van Dyke

There are many motivations driving the desire for increased energy efficiency. While many sectors share similar motivations, the High Performance Computing (HPC) sector must address a different set of challenges in achieving energy efficiency. This chapter will outline some of the motivations of this research along with the approach taken to address these recognized challenges, specifically for large-scale platforms.

Archive | 2016

High Performance Computing - Power Application Programming Interface Specification Version 1.1a

James H. Laros; David DeBonis; Ryan E. Grant; Suzanne M. Kelly; Michael J. Levenhagen; Stephen L. Olivier; Kevin Pedretti

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

Archive | 2013

Energy Delay Product

James H. Laros; Kevin Pedretti; Suzanne M. Kelly; Wei Shu; Kurt Brian Ferreira; John P. VanDyke

In this chapter, data from both the CPU frequency tuning experiments (Chap. 6) and the network bandwidth experiments (Chap. 7) are analyzed using a range of fused metrics based on Energy Delay Product (EDP). The analysis in this chapter demonstrates how multiple metrics can be combined and observed as a single fused metric. Additionally, a form of weighted EDP is used to more highly prioritize, or weight, performance over energy savings.

Archive | 2012

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Matthew L. Curry; Kurt Brian Ferreira; Kevin Pedretti; Vitus J. Leung; Kenneth Moreland; Gerald Fredrick Lofstead; Ann C. Gentile; Ruth Klundt; H. Lee Ward; James H. Laros; Karl Scott Hemmert; Nathan D. Fabian; Michael J. Levenhagen; Ronald B. Brightwell; Richard Frederick Barrett; Kyle Bruce Wheeler; Suzanne M. Kelly; Arun F. Rodrigues; James M. Brandt; David C. Thompson; John P. VanDyke; Ron A. Oldfield; Thomas Tucker

This report documents thirteen of Sandias contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Applications Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

Explore More