James M. Brandt | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James M. Brandt is active.

Explore More

Publication

Featured researches published by James M. Brandt.

international parallel and distributed processing symposium | 2009

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

James M. Brandt; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

Using the cloud computing paradigm, a host of companies promise to make huge compute resources available to users on a pay-as-you-go basis. These resources can be configured on the fly to provide the hardware and operating system of choice to the customer on a large scale. While the current target market for these resources in the commercial space is web development/hosting, this model has the lure of savings of ownership, operation, and maintenance costs, and thus sounds like an attractive solution for people who currently invest millions to hundreds of millions of dollars annually on High Performance Computing (HPC) platforms in order to support large-scale scientific simulation codes. Given the current interconnect bandwidth and topologies utilized in these commercial offerings, however, the only current viable market in HPC would be small-memory-footprint embarrassingly parallel or loosely coupled applications, which inherently require little to no inter-processor communication. While providing the appropriate resources (bandwidth, latency, memory, etc.) for the HPC community would increase the potential to enable HPC in cloud environments, this would not address the need for scalability and reliability, crucial to HPC applications. Providing for these needs is particularly difficult in commercial cloud offerings where the number of virtual resources can far outstrip the number of physical resources, the resources are shared among many users, and the resources may be heterogeneous. Advanced resource monitoring, analysis, and configuration tools can help address these issues, since they bring the ability to dynamically provide and respond to information about the platform and application state and would enable more appropriate, efficient, and flexible use of the resources key to enabling HPC. Additionally such tools could be of benefit to non-HPC cloud providers, users, and applications by providing more efficient resource utilization in general.

international conference on cluster computing | 2014

Demonstrating improved application performance using dynamic monitoring and task mapping

James M. Brandt; Karen Dragon Devine; Ann C. Gentile; Kevin Pedretti

This work demonstrates the integration of monitoring, analysis, and feedback to perform application-to-resource mapping that adapts to both static architecture features and dynamic resource state. In particular, we present a framework for mapping MPI tasks to compute resources based on run-time analysis of system-wide network data, architecture-specific routing algorithms, and application communication patterns. We address several challenges. Within each node, we collect local utilization data. We consolidate that information to form a global view of system performance, accounting for system-wide factors including competing applications. We provide an interface for applications to query the global information. Then we exploit the system information to change the mapping of tasks to nodes so that system bottlenecks are avoided. We demonstrate the benefit of this monitoring and feedback by remapping MPI tasks based on route-length, bandwidth, and credit-stalls metrics for a parallel sparse matrix-vector multiplication kernel. In the best case, remapping based on dynamic network information in a congested environment recovered 48.9% of the time lost to congestion, reducing matrix-vector multiplication time by 7.8%. Our experiments focus on the Cray XE/XK platform, but the integration concepts are generally applicable to any platform for which applicable metrics and route knowledge can be obtained.

Archive | 2012

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Matthew L. Curry; Kurt Brian Ferreira; Kevin Pedretti; Vitus J. Leung; Kenneth Moreland; Gerald Fredrick Lofstead; Ann C. Gentile; Ruth Klundt; H. Lee Ward; James H. Laros; Karl Scott Hemmert; Nathan D. Fabian; Michael J. Levenhagen; Ronald B. Brightwell; Richard Frederick Barrett; Kyle Bruce Wheeler; Suzanne M. Kelly; Arun F. Rodrigues; James M. Brandt; David C. Thompson; John P. VanDyke; Ron A. Oldfield; Thomas Tucker

This report documents thirteen of Sandias contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Applications Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

international conference on cluster computing | 2015

Toward Rapid Understanding of Production HPC Applications and Systems

Anthony Michael Agelastos; Benjamin A. Allan; James M. Brandt; Ann C. Gentile; Sophia Lefantzi; Stephen Todd Monk; Jeffrey Brandon Ogden; Mahesh Rajan; Joel O. Stevenson

A detailed understanding of HPC applications resource needs and their complex interactions with each other and HPC platform resources is critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

grid computing | 2010

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

James M. Brandt; Frank Xiaoxiao Chen; Vincent De Sapio; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

Accurate failure prediction in conjunction with efficient process migration facilities including some Cloud constructs can enable failure avoidance in large-scale high performance computing (HPC) platforms. In this work we demonstrate a prototype system that incorporates our probabilistic failure prediction system with virtualization mechanisms and techniques to provide a whole system approach to failure avoidance. This work utilizes a failure scenario based on a real-world HPC case study.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Combining Virtualization, resource characterization, and Resource management to enable efficient high performance compute platforms through intelligent dynamic resource allocation

James M. Brandt; Frank Xiaoxiao Chen; V. De Sapio; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

Improved resource utilization and fault tolerance of large-scale HPC systems can be achieved through fine-grained, intelligent, and dynamic resource (re)allocation. We explore components and enabling technologies applicable to creating a system to provide this capability: specifically 1) Scalable fine-grained monitoring and analysis to inform resource allocation decisions, 2) Virtualization to enable dynamic reconfiguration, 3) Resource management for the combined physical and virtual resources and 4) Orchestration of the allocation, evaluation, and balancing of resources in a dynamic environment. We discuss both general and HPC-centric issues that impact the design of such a system. Finally, we present our prototype system, giving both design details and examples of its application in real-world scenarios.

dependable systems and networks | 2010

Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example

James M. Brandt; Frank Xiaoxiao Chen; Vincent De Sapio; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe context-relevant methodologies for determining the accuracy and cost-benefit of predictors.

Archive | 2005

Meaningful statistical analysis of large computational clusters.

Ann C. Gentile; Youssef M. Marzouk; James M. Brandt; Philippe Pierre Pebay

Effective monitoring of large computational clusters demands the analysis of a vast amount of raw data from a large number of machines. The fundamental interactions of the system are not, however, well-defined, making it difficult to draw meaningful conclusions from this data, even if one were able to efficiently handle and process it. In this paper we show that computational clusters, because they are comprised of a large number of identical machines, behave in a statistically meaningful fashion. We therefore can employ normal statistical methods to derive information about individual systems and their environment and to detect problems sooner than with traditional mechanisms. We discuss design details necessary to use these methods on a large system in a timely and low-impact fashion.

international conference on cluster computing | 2015

Extending LDMS to Enable Performance Monitoring in Multi-core Applications

Steven D. Feldman; Deli Zhang; Damian Dechev; James M. Brandt

Identifying design patterns that limit the performance of multi-core algorithms is a challenging task. There are many known methods by which threads synchronize their actions and each method may exhibit different behavior in different use cases. These use cases may vary in regards to the workload being executed, number of parallel tasks, dependencies between these tasks, and the behavior of the system scheduler. Restructuring algorithms to overcome performance limitations requires intimate knowledge on how these algorithms utilize the hardware. In our experience, we have found a lack of adequate tools to gain such knowledge. To address this, we have enhanced and implemented additional data sampler modules for OVISs Lightweight Distributed Metric Service (LDMS) to enable scalable distributed collection of hardware performance counter data. These modules provide an interface by which LDMS can utilize the PAPI library, Linux perf tools, and RAPL to collect hardware performance data of interest. Using these samplers, we plan to monitor the intra-node behavior, including contention for node level shared resources, of multi-core applications for a diverse set of use cases. We are currently exploring how the values reported are affected by the level of concurrency, the synchronization methodologies, and progress guarantees. We hope to use this information to identify ways to restructure algorithms to increase their performance.

Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization | 2015

Infrastructure for In Situ System Monitoring and Application Data Analysis

James M. Brandt; Karen Dragon Devine; Ann C. Gentile

We present an architecture for high-performance computers that integrates in situ analysis of hardware and system monitoring data with application-specific data to reduce application runtimes and improve overall platform utilization. Large-scale high-performance computing systems typically use monitoring as a tool unrelated to application execution. Monitoring data flows from sampling points to a centralized off-system machine for storage and post-processing when root-cause analysis is required. Along the way, it may also be used for instantaneous threshold-based error detection. Applications can know their application state and possibly allocated resource state, but typically, they have no insight into globally shared resource state that may affect their execution. By analyzing performance data in situ rather than off-line, we enable applications to make real-time decisions about their resource utilization. We address the particular case of in situ network congestion analysis and its potential to improve task placement and data partitioning. We present several design and analysis considerations.

Explore More