Jim M. Brandt | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jim M. Brandt is active.

Explore More

Publication

Featured researches published by Jim M. Brandt.

international parallel and distributed processing symposium | 2008

Ovis-2: A robust distributed architecture for scalable RAS

Jim M. Brandt; Bert J. Debusschere; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; David C. Thompson; Matthew H. Wong

Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health.

Computer Science - Research and Development | 2011

Baler: deterministic, lossless log message clustering tool

Narate Taerat; Jim M. Brandt; Ann C. Gentile; Matthew H. Wong; Chokchai Leangsuksun

The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, system log files are usually too large and complex to analyze manually. There are some existing log clustering tools that seek to help analysts in exploring these logs, however they fail to satisfy our needs with respect to scalability, usability and quality of results. Thus, we have developed a log clustering tool to better address these needs. In this paper we present our novel approach and initial experimental results.

Proceedings of the 2009 workshop on Resiliency in high performance | 2009

Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Jim M. Brandt; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.

cluster computing and the grid | 2008

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Jim M. Brandt; Bert J. Debusschere; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; David C. Thompson; Matthew H. Wong

The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.

dependable systems and networks | 2012

Filtering log data: Finding the needles in the Haystack

Li Yu; Ziming Zheng; Zhiling Lan; Terry Jones; Jim M. Brandt; Ann C. Gentile

Log data is an incredible asset for troubleshooting in large-scale systems. Nevertheless, due to the ever-growing system scale, the volume of such data becomes overwhelming, bringing enormous burdens on both data storage and data analysis. To address this problem, we present a 2-dimensional online filtering mechanism to remove redundant and noisy data via feature selection and instance selection. The objective of this work is two-fold: (i) to significantly reduce data volume without losing important information, and (ii) to effectively promote data analysis. We evaluate this new filtering mechanism by means of real environmental data from the production supercomputers at Oak Ridge National Laboratory and Sandia National Laboratory. Our preliminary results demonstrate that our method can reduce more than 85% disk space, thereby significantly reducing analysis time. Moreover, it also facilitates better failure prediction and diagnosis by more than 20%, as compared to the conventional predictive approach relying on RAS (Reliability, Availability, and Serviceability) events alone.

international conference on cluster computing | 2005

Meaningful Automated Statistical Analysis of Large Computational Clusters

Jim M. Brandt; Ann C. Gentile; Youssef M. Marzouk; Philippe Pierre Pebay

As clusters utilizing commercial off-the-shelf technology have grown from tens to thousands of nodes and typical job sizes have likewise increased, much effort has been devoted to improving the scalability of message-passing fabrics, schedulers, and storage. Largely ignored, however, has been the issue of predicting node failure, which also has a large impact on scalability. In fact, more than ten years into cluster computing, we are still managing this issue on a node-by-node basis even though available diagnostic data has grown immensely. We have built a tool that uses the statistical similarity of the large number of nodes in a cluster to infer the health of each individual node. In the poster, we first present real data and statistical calculations as foundational material and justification for our claims of similarity. Next we present our methodology and its implications for early notification of deviation from normal behavior, problem diagnosis, automatic code restart via interaction with scheduler, and airflow distribution monitoring in the machine room. A framework addressing scalability is discussed briefly. Lastly, we present case studies showing how our methodology has been used to detect aberrant nodes whose deviations are still far below the detection level of traditional methods. A summary of the results of the case studies appears below

international supercomputing conference | 2017

Diagnosing Performance Variations in HPC Applications Using Machine Learning

Ozan Tuncer; Emre Ates; Yijia Zhang; Ata Turk; Jim M. Brandt; Vitus J. Leung; Manuel Egele; Ayse Kivilcim Coskun

With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. Application performance variation can be caused by resource contention as well as software- and firmware-related problems, and can lead to premature job termination, reduced performance, and wasted compute platform resources. To effectively alleviate this problem, system administrators must detect and identify the anomalies that are responsible for performance variation and take preventive actions. However, diagnosing anomalies is often a difficult task given the vast amount of noisy and high-dimensional data being collected via a variety of system monitoring infrastructures.

international conference on parallel processing | 2011

Framework for enabling system understanding

Jim M. Brandt; Frank Xiaoxiao Chen; Ann C. Gentile; Chokchai Leangsuksun; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; Narate Taerat; David C. Thompson; Matthew H. Wong

Building the effective HPC resilience mechanisms required for viability of next generation supercomputers will require in depth understanding of system and component behaviors. Our goal is to build an integrated framework for high fidelity long term information storage, historic and run-time analysis, algorithmic and visual information exploration to enable system understanding, timely failure detection/prediction, and triggering of appropriate response to failure situations. Since it is unknown what information is relevant and since potentially relevant data may be expressed in a variety of forms (e.g., numeric, textual), this framework must provide capabilities to process different forms of data and also support the integration of new data, data sources, and analysis capabilities. Further, in order to ensure ease of use as capabilities and data sources expand, it must also provide interactivity between its elements. This paper describes our integration of the capabilities mentioned above into our OVIS tool.

international parallel and distributed processing symposium | 2016

Design and Implementation of a Scalable HPC Monitoring System

S. Sanchez; A. Bonnie; G. Van Heule; C. Robinson; A. DeConinck; K. Kelly; Q. Snead; Jim M. Brandt

Over the past decade, platforms at Los Alamos National Laboratory (LANL) have experienced large increases in complexity and scale to reach computational targets. The changes to the compute platforms have presented new challenges to the production monitoring systems in which they must not only cope with larger volumes of monitoring data, but also must provide new capabilities for the management, distribution, and analysis of this data. This schema must support both real-time analysis for alerting on urgent issues, as well as analysis of historical data for understanding performance issues and trends in systembehavior. This paper presents the design of our proposed next-generation monitoring system, as well as implementation details for an initial deployment. This design takes the form of a multi-stage data processing pipeline, including a scalable cluster for data aggregation and early analysis, a message broker for distribution of this data to varied consumers, and an initial selection of consumer services for alerting and analysis. We will also present estimates of the capabilities and scale required to monitor two upcoming compute platforms at LANL.

international parallel and distributed processing symposium | 2016

Large-Scale Persistent Numerical Data Source Monitoring System Experiences

Jim M. Brandt; Ann C. Gentile; Michael T. Showerman; Jeremy Enos; Joshi Fullop; Gregory H. Bauer

Issues of High Performance Computer (HPC) system diagnosis, automated system management, and resource-aware computing, are all dependent on high fidelity, system wide, persistent monitoring. Development and deployment of an effective persistent system wide monitoring service at large-scale presents a number of challenges, particularly when collecting data at the granularities needed to resolve features of interest and obtain early indication of significant events on the system. In this paper we provide experiences from our developments on and two-year deployment of our Lightweight Distributed Metric Service (LDMS) monitoring system on NCSAs 27,648 node Blue Waters system. We present monitoring related challenges and issues and their effects on the major functional components of general monitoring infrastructures and deployments: Data Sampling, Data Aggregation, Data Storage, Analysis Support, Operations, and Data Stewardship. Based on these experiences, we providerecommendations for effective development and deployment of HPC monitoring systems.

Explore More