Adam J. Oliner | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Adam J. Oliner is active.

Explore More

Publication

Featured researches published by Adam J. Oliner.

dependable systems and networks | 2007

What Supercomputers Say: A Study of Five System Logs

Adam J. Oliner; Jon Stearley

If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from five supercomputers, with the aim of providing useful insight and direction for future research into the use of such logs. We present details about the systems, methods of log collection, and how alerts were identified; propose a simpler and more effective filtering algorithm; and define operational context to encompass the crucial information that we found to be currently missing from most logs. The machines we consider (and the number of processors) are: Blue Gene/L (131072), Red Storm (10880), Thunderbird (9024), Spirit (1028), and Liberty (512). This is the first study of raw system logs from multiple supercomputers.

international conference on supercomputing | 2006

Cooperative checkpointing: a robust approach to large-scale systems reliability

Adam J. Oliner; Larry Rudolph; Ramendra K. Sahoo

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing.

dependable systems and networks | 2010

Using correlated surprise to infer shared influence

Adam J. Oliner; Ashutosh Kulkarni; Alex Aiken

We propose a method for identifying the sources of problems in complex production systems where, due to the prohibitive costs of instrumentation, the data available for analysis may be noisy or incomplete. In particular, we may not have complete knowledge of all components and their interactions. We define influences as a class of component interactions that includes direct communication and resource contention. Our method infers the influences among components in a system by looking for pairs of components with time-correlated anomalous behavior. We summarize the strength and directionality of shared influences using a Structure-of-Influence Graph (SIG). This paper explains how to construct a SIG and use it to isolate system misbehavior, and presents both simulations and in-depth case studies with two autonomous vehicles and a 9024-node production supercomputer.

dependable systems and networks | 2011

Online detection of multi-component interactions in production systems

Adam J. Oliner; Alex Aiken

We present an online, scalable method for inferring the interactions among the components of large production systems. We validate our approach on more than 1.3 billion lines of log files from eight unmodified production systems, showing that our approach efficiently identifies important relationships among components, handles very large systems with many simultaneous signals in real time, and produces information that is useful to system administrators.

international parallel and distributed processing symposium | 2006

Cooperative checkpointing theory

Adam J. Oliner; Larry Rudolph; Ramendra K. Sahoo

Cooperative checkpointing uses global knowledge of the state and health of the machine to improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. Using results from cooperative checkpointing theory, this paper proves that periodic checkpointing is not expected to be competitive with the offline optimal. By leveraging probabilistic information about the future, cooperative checkpointing gives flexible algorithms that are optimally competitive. The results prove that simulating periodic checkpointing; by performing only every dth checkpoint, is not competitive with the offline optimal in the worst case; a simple modification gives a provably competitive algorithm. Calculations using failure traces from a prototype of IBMs Blue Gene/L show an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing, under realistic conditions. We contribute an approach to providing large-scale system reliability through cooperative checkpointing and techniques for analyzing the approach

recent advances in intrusion detection | 2010

Community epidemic detection using time-correlated anomalies

Adam J. Oliner; Ashutosh Kulkarni; Alex Aiken

An epidemic is malicious code running on a subset of a community, a homogeneous set of instances of an application. Syzygy is an epidemic detection framework that looks for time-correlated anomalies, i.e., divergence from a model of dynamic behavior. We show mathematically and experimentally that, by leveraging the statistical properties of a large community, Syzygy is able to detect epidemics even under adverse conditions, such as when an exploit employs both mimicry and polymorphism. This work provides a mathematical basis for Syzygy, describes our particular implementation, and tests the approach with a variety of exploits and on commodity server and desktop applications to demonstrate its effectiveness.

international parallel and distributed processing symposium | 2006

Evaluating cooperative checkpointing for supercomputing systems

Adam J. Oliner; Ramendra K. Sahoo

Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, risk-based checkpointing with event prediction accuracy as low as 10% is able to significantly improve system utilization and reduce average bounded slowdown by a factor of 9, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance benefits in the face of large checkpoint overheads

international conference on supercomputing | 2010

A query language for understanding component interactions in production systems

Adam J. Oliner; Alex Aiken

When something unexpected happens in a large production system, administrators must first perform a search to isolate which components and component interactions are likely to be involved. The system may consist of thousands of interacting subsystems, the logging instrumentation may be noisy or incomplete, and the problem description may be vague, so this search is often the most difficult part of understanding the systems behavior. To facilitate the search process, we present a query language and a method for computing these queries that makes minimal assumptions about the available data. We evaluate our method on nearly 1.22 billion lines of system logs from four supercomputers, two autonomous vehicles, and a server cluster.

Proceedings of the 2007 workshop on Experimental computer science | 2007

RA: ResearchAssistant for the computational sciences

Daniel Ramage; Adam J. Oliner

Computational experiments often discard large amounts of valuable data, such as invocation parameters and the lineage of output. Our goal is to identify, manage, capture, and organize this information. These data can be used to make the scientific process simpler and more efficient, and to increase the value of the research by making it more rigorous and reproducible. Research Assistant (RA) is an open source Java programming tool that helps to plug this information leak. RA ensures that all console output is valid XML; saves invocation parameters, the random seed, and code version information; automatically checkpoints intermediate results; creates runnable experiment packages; and keeps meticulous notes. This paper presents the design and implementation of RA, and shows how RA easily scales to make complex experiments repeatable.

international conference on data mining | 2008