Susan Coghlan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Susan Coghlan is active.

Explore More

Publication

Featured researches published by Susan Coghlan.

international conference on cluster computing | 2006

The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale

Peter H. Beckman; Kamil Iskra; Kazutomo Yoshii; Susan Coghlan

We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence

scientific cloud computing | 2011

Magellan: experiences from a science cloud

Lavanya Ramakrishnan; Piotr T. Zbiegel; Scott Campbell; Rick Bradshaw; Richard Shane Canon; Susan Coghlan; Iwona Sakrejda; Narayan Desai; Tina Declerck; Anping Liu

Cloud resources promise to be an avenue to address new categories of scientific applications including data-intensive science applications, on-demand/surge computing, and applications that require customized software environments. However, there is a limited understanding on how to operate and use clouds for scientific applications. Magellan, a project funded through the Department of Energys (DOE) Advanced Scientific Computing Research (ASCR) program, is investigating the use of cloud computing for science at the Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Facility (NERSC). In this paper, we detail the experiences to date at both sites and identify the gaps and open challenges from both a resource provider as well as application perspective.

Cluster Computing | 2008

Benchmarking the effects of operating system interference on extreme-scale parallel machines

Peter H. Beckman; Kamil Iskra; Kazutomo Yoshii; Susan Coghlan; Aroon Nataraj

Abstract We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.

ieee international conference on high performance computing data and analytics | 2013

Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems

Xu Yang; Zhou Zhou; Sean Wallace; Zhiling Lan; Wei Tang; Susan Coghlan; Michael E. Papka

The research literature to date mainly aimed at reducing energy consumption in HPC environments. In this paper we propose a job power aware scheduling mechanism to reduce HPCs electricity bill without degrading the system utilization. The novelty of our job scheduling mechanism is its ability to take the variation of electricity price into consideration as a means to make better decisions of the timing of scheduling jobs with diverse power profiles. We verified the effectiveness of our design by conducting trace-based experiments on an IBM Blue Gene/P and a cluster system as well as a case study on Argonnes 48-rack IBM Blue Gene/Q system. Our preliminary results show that our power aware algorithm can reduce electricity bill of HPC systems as much as 23%.

dependable systems and networks | 2010

A practical failure prediction with location and lead time for Blue Gene/P

Ziming Zheng; Zhiling Lan; Rinku Gupta; Susan Coghlan; Peter H. Beckman

Analyzing, understanding and predicting failure is of paramount importance to achieve effective fault management. While various fault prediction methods have been studied in the past, many of them are not practical for use in real systems. In particular, they fail to address two crucial issues: one is to provide location information (i.e., the components where the failure is expected to occur on) and the other is to provide sufficient lead time (i.e., the time interval preceding the time of failure occurrence). In this paper, we first refine the widely-used metrics for evaluating prediction accuracy by including location as well as lead time. We, then, present a practical failure prediction mechanism for IBM Blue Gene systems. A Genetic Algorithm based method is exploited, which takes into consideration the location and the lead time for failure prediction. We demonstrate the effectiveness of this mechanism by means of real failure logs and job logs collected from the IBM Blue Gene/P system at Argonne National Laboratory. Our experiments show that the presented method can significantly improve fault management (e.g., to reduce service unit loss by up to 52.4%) by incorporating location and lead time information in the prediction.

Operating Systems Review | 2006

Operating system issues for petascale systems

Peter H. Beckman; Kamil Iskra; Kazutomo Yoshii; Susan Coghlan

Petascale supercomputers will be available by 2008. The largest machine of these complex leadership-class machines will probably have nearly 250K CPUs. These massively parallel systems have a number of challenging operating system issues. In this paper, we focus on the issues most important for the system that will first breach the petaflop barrier: synchronization and collective operations, parallel I/O, and fault tolerance.

Journal of Parallel and Distributed Computing | 2010

A study of dynamic meta-learning for failure prediction in large-scale systems

Zhiling Lan; Jiexing Gu; Ziming Zheng; Rajeev Thakur; Susan Coghlan

Despite years of study on failure prediction, it remains an open problem, especially in large-scale systems composed of vast amount of components. In this paper, we present a dynamic meta-learning framework for failure prediction. It intends to not only provide reasonable prediction accuracy, but also be of practical use in realistic environments. Two key techniques are developed to address technical challenges of failure prediction. One is meta-learning to boost prediction accuracy by combining the benefits of multiple predictive techniques. The other is a dynamic approach to dynamically obtain failure patterns from a changing training set and to dynamically extract effective rules by actively monitoring prediction accuracy at runtime. We demonstrate the effectiveness and practical use of this framework by means of real system logs collected from the production Blue Gene/L systems at Argonne National Laboratory and San Diego Supercomputer Center. Our case studies indicate that the proposed mechanism can provide reasonable prediction accuracy by forecasting up to 82% of the failures, with a runtime overhead less than 1.0 min.

dependable systems and networks | 2011

Practical online failure prediction for Blue Gene/P: Period-based vs event-driven

Li Yu; Ziming Zheng; Zhiling Lan; Susan Coghlan

To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, online failure prediction is of paramount importance. While many techniques have been presented for online failure prediction, questions arise regarding two commonly used approaches: period-based and event-driven. Which one has better accuracy? What is the best observation window (i.e., the time interval used to collect evidence before making a prediction)? How does the lead time (i.e., the time interval from the prediction to the failure occurrence) impact prediction arruracy? To answer these questions, we analyze and compare period-based and event-driven prediction approaches via a Bayesian prediction model. We evaluate these prediction approaches, under a variety of testing parameters, by means of RAS logs collected from a production supercomputer at Argonne National Laboratory. Experimental results show that the period-based Bayesian model and the event-driven Bayesian model can achieve up to 65.0% and 83.8% prediction accuracy, respectively. Furthermore, our sensitivity study indicates that the event-driven approach seems more suitable for proactive fault management in large-scale systems like Blue Gene/P.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Measuring Power Consumption on IBM Blue Gene/Q

Sean Wallace; Venkatram Vishwanath; Susan Coghlan; Zhiling Lan; Michael E. Papka

In addition to pushing what is possible computationally, state-of-the-art supercomputers are also pushing what is acceptable in terms of power consumption. Despite hardware manufacturers researching and developing efficient system components (e.g., processor, memory, etc.), the power consumption of a complete system remains an understudied research area. Because of the complexity and unpredictable workloads of these systems, estimating the power consumption of a full system is a nontrivial task.In this paper, we provide system-level power usage and temperature analysis of early access to Argonnes latest generation of IBM Blue Gene supercomputers, the Mira Blue Gene/Q system. The analysis is provided from the point of view of jobs running on the system. We describe the important implications these system level measurements have as well as the challenges they present. Using profiling code on benchmarks, we will also look at the new tools this latest generation of supercomputer provides and gauge their usefulness and how well they match up against the environmental data.

international conference on performance engineering | 2014

A power-measurement methodology for large-scale, high-performance computing

Thomas R. W. Scogland; Craig P. Steffen; Torsten Wilde; Florent Parent; Susan Coghlan; Natalie J. Bates; Wu-chun Feng; Erich Strohmaier

Improvement in the energy efficiency of supercomputers can be accelerated by improving the quality and comparability of efficiency measurements. The ability to generate accurate measurements at extreme scale are just now emerging. The realization of system-level measurement capabilities can be accelerated with a commonly adopted and high quality measurement methodology for use while running a workload, typically a benchmark. This paper describes a methodology that has been developed collaboratively through the Energy Efficient HPC Working Group to support architectural analysis and comparative measurements for rankings, such as the Top500 and Green500. To support measurements with varying amounts of effort and equipment required we present three distinct levels of measurement, which provide increasing levels of accuracy. Level 1 is similar to the Green500 run rules today, a single average power measurement extrapolated from a subset of a machine. Level 2 is more comprehensive, but still widely achievable. Level 3 is the most rigorous of the three methodologies but is only possible at a few sites. However, the Level 3 methodology generates a high quality result that exposes details that the other methodologies may miss. In addition, we present case studies from the Leibniz Supercomputing Centre (LRZ), Argonne National Laboratory (ANL) and Calcul Québec Université Laval that explore the benefits and difficulties of gathering high quality, system-level measurements on large-scale machines.

Explore More