Jackson R. Mayo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jackson R. Mayo is active.

Explore More

Publication

Featured researches published by Jackson R. Mayo.

International Journal of Distributed Systems and Technologies | 2010

A Simulator for Large-Scale Parallel Computer Architectures

Helgi Adalsteinsson; Scott Cranford; David A. Evensky; Joseph P. Kenny; Jackson R. Mayo; Ali Pinar; Curtis L. Janssen

Efficient design of hardware and software for large-scale parallel execution requires detailed understanding of the interactions between the application, computer, and network. The authors have developed a macro-scale simulator SST/macro that permits the coarse-grained study of distributed-memory applications. In the presented work, applications using the Message Passing Interface MPI are simulated; however, the simulator is designed to allow inclusion of other programming models. The simulator is driven from either a trace file or a skeleton application. Trace files can be either a standard format Open Trace Format or a more detailed custom format DUMPI. The simulator architecture is modular, allowing it to easily be extended with additional network models, trace file formats, and more detailed processor models. This paper describes the design of the simulator, provides performance results, and presents studies showing how application performance is affected by machine characteristics.

international parallel and distributed processing symposium | 2009

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

James M. Brandt; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

Using the cloud computing paradigm, a host of companies promise to make huge compute resources available to users on a pay-as-you-go basis. These resources can be configured on the fly to provide the hardware and operating system of choice to the customer on a large scale. While the current target market for these resources in the commercial space is web development/hosting, this model has the lure of savings of ownership, operation, and maintenance costs, and thus sounds like an attractive solution for people who currently invest millions to hundreds of millions of dollars annually on High Performance Computing (HPC) platforms in order to support large-scale scientific simulation codes. Given the current interconnect bandwidth and topologies utilized in these commercial offerings, however, the only current viable market in HPC would be small-memory-footprint embarrassingly parallel or loosely coupled applications, which inherently require little to no inter-processor communication. While providing the appropriate resources (bandwidth, latency, memory, etc.) for the HPC community would increase the potential to enable HPC in cloud environments, this would not address the need for scalability and reliability, crucial to HPC applications. Providing for these needs is particularly difficult in commercial cloud offerings where the number of virtual resources can far outstrip the number of physical resources, the resources are shared among many users, and the resources may be heterogeneous. Advanced resource monitoring, analysis, and configuration tools can help address these issues, since they bring the ability to dynamically provide and respond to information about the platform and application state and would enable more appropriate, efficient, and flexible use of the resources key to enabling HPC. Additionally such tools could be of benefit to non-HPC cloud providers, users, and applications by providing more efficient resource utilization in general.

international parallel and distributed processing symposium | 2008

Ovis-2: A robust distributed architecture for scalable RAS

Jim M. Brandt; Bert J. Debusschere; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; David C. Thompson; Matthew H. Wong

Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health.

Proceedings of the 2009 workshop on Resiliency in high performance | 2009

Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Jim M. Brandt; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.

cluster computing and the grid | 2008

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Jim M. Brandt; Bert J. Debusschere; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; David C. Thompson; Matthew H. Wong

The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.

ieee international conference on high performance computing data and analytics | 2015

Local recovery and failure masking for stencil-based applications at extreme scales

Marc Gamell; Keita Teranishi; Michael A. Heroux; Jackson R. Mayo; Hemanth Kolla; Jacqueline H. Chen; Manish Parashar

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.

high performance distributed computing | 2015

Exploring Failure Recovery for Stencil-based Applications at Extreme Scales

Marc Gamell; Keita Teranishi; Michael A. Heroux; Jackson R. Mayo; Hemanth Kolla; Jacqueline H. Chen; Manish Parashar

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the overheads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.

Physical Review Letters | 2011

Influence and Dynamic Behavior in Random Boolean Networks

C. Seshadhri; Yevgeniy Vorobeychik; Jackson R. Mayo; Robert C. Armstrong; Joseph R. Ruthruff

We present a rigorous mathematical framework for analyzing dynamics of a broad class of boolean network models. We use this framework to provide the first formal proof of many of the standard critical transition results in boolean network analysis, and offer analogous characterizations for novel classes of random boolean networks. We show that some of the assumptions traditionally made in the more common mean-field analysis of boolean networks do not hold in general. For example, we offer evidence that imbalance (internal inhomogeneity) of transfer functions is a crucial feature that tends to drive quiescent behavior far more strongly than previously observed.

cyber security and information intelligence research workshop | 2009

Leveraging complexity in software for cybersecurity

Robert C. Armstrong; Jackson R. Mayo

A method for assessing statistically quantifiable improvements in security for software vulnerabilities is presented. Drawing on concepts in complexity theory, undecidability, and previous work in high-reliability systems, we show that ensembles of similar implementations have statistical value even though each by itself is inscrutable. Research questions are identified that may allow practical application of these concepts.

grid computing | 2010

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

James M. Brandt; Frank Xiaoxiao Chen; Vincent De Sapio; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

Accurate failure prediction in conjunction with efficient process migration facilities including some Cloud constructs can enable failure avoidance in large-scale high performance computing (HPC) platforms. In this work we demonstrate a prototype system that incorporates our probabilistic failure prediction system with virtualization mechanisms and techniques to provide a whole system approach to failure avoidance. This work utilizes a failure scenario based on a real-world HPC case study.

Explore More