Dorian C. Arnold | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dorian C. Arnold is active.

Explore More

Publication

Featured researches published by Dorian C. Arnold.

conference on high performance computing (supercomputing) | 2003

MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Philip C. Roth; Dorian C. Arnold; Barton P. Miller

We present MRNet, a software-based multicast/reduction network for building scalable performance and system administration tools. MRNet supports multiple simultaneous, asynchronous collective communication operations. MRNet is flexible, allowing tool builders to tailor its process network topology to suit their tools requirements and the underlying systems capabilities. MRNet is extensible, allowing tool builders to incorporate custom data reductions to augment its collection of built-in reductions. We evaluated MRNet in a simple test tool and also integrated into an existing, real-world performance tool with up to 512 tool back-ends. In the real-world tool, we used MRNet not only for multicast and simple data reductions but also with custom histogram and clock skew detection reductions. In our experiments, the MRNet-based tools showed significantly better performance than the tools without MRNet for average message latency and throughput, overall tool start-up latency, and performance data processing throughput.

ieee international conference on high performance computing data and analytics | 2011

Evaluating the viability of process replication reliability for exascale systems

Kurt Brian Ferreira; Jon Stearley; James H. Laros; Ron A. Oldfield; Kevin Pedretti; Ronald B. Brightwell; Rolf Riesen; Patrick G. Bridges; Dorian C. Arnold

As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an applications time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to check-point/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.

international parallel and distributed processing symposium | 2007

Stack Trace Analysis for Large Scale Debugging

Dorian C. Arnold; Dong H. Ahn; B.R. de Supinski; Gregory L. Lee; Barton P. Miller; Martin Schulz

We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT can reduce problem exploration spaces from thousands of processes to a few by sampling stack traces to form process equivalence classes, groups of processes exhibiting similar behavior. We can then use full-featured debuggers on representatives from these behavior classes for root cause analysis. STAT scalably collects stack traces over a sampling period to assemble a profile of the applications behavior. STAT routines process the samples to form a call graph prefix tree that encodes common behavior classes over the programs process space and time. STAT leverages MRNet, an infrastructure for tool control and data analyses, to overcome scalability barriers faced by heavy-weight debuggers. We present STATs design and an evaluation that shows STAT gathers informative process traces from thousands of processes with sub-second latencies, a significant improvement over existing tools. Our case studies of production codes verify that STAT supports the quick identification of errors that were previously difficult to locate.

european conference on parallel processing | 2000

Request Sequencing: Optimizing Communication for the Grid

Dorian C. Arnold; Dieter Bachmann; Jack J. Dongarra

As we research to make the use of Computational Grids seamless, the allocation of resources in these dynamic environments is proving to be very unwieldy. In this paper, we introduce, describe and evaluate a technique we call request sequencing. Request sequencing groups together requests for Grid services to exploit some common characteristics of these requests andmi nimize network traffic. The purpose of this work is to develop and validate this approach. We show how request sequencing can affect scheduling policies and enable more expedient resource allocation methods. We also discuss some of the reasons for our design, offer the initial results and discuss issues that remain outstanding for future research.

ieee international conference on high performance computing data and analytics | 2013

Using simulation to explore distributed key-value stores for extreme-scale system services

Ke Wang; Abhishek Kulkarni; Michael Lang; Dorian C. Arnold; Ioan Raicu

Owing to the significant high rate of component failures at extreme scales, system services will need to be failure-resistant, adaptive and self-healing. A majority of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. Peer-to-peer services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC service workloads to support extreme-scale systems. The simulator is validated against existing distributed KVS-based services. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real HPC service workloads into the simulator and presenting a KVS-based distributed job launch prototype.

Concurrency and Computation: Practice and Experience | 2002

Innovations of the NetSolve Grid Computing System

Dorian C. Arnold; Henri Casanova; Jack J. Dongarra

The NetSolve Grid Computing System was first developed in the mid 1990s to provide users with seamless access to remote computational hardware and software resources. Since then, the system has benefitted from many enhancements like security services, data management faculties and distributed storage infrastructures. This article is meant to provide the reader with details regarding the present state of the project, describing the current architecture of the system, its latest innovations and other systems that make use of the NetSolve infrastructure. Copyright

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface | 2011

libhashckpt: hash-based incremental checkpointing using GPU's

Kurt Brian Ferreira; Rolf Riesen; Ron Brighwell; Patrick G. Bridges; Dorian C. Arnold

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.

ieee international conference on high performance computing data and analytics | 2008

Lessons learned at 208K: towards debugging millions of cores

Gregory L. Lee; Dong H. Ahn; Dorian C. Arnold; Bronis R. de Supinski; Matthew P. LeGendre; Barton P. Miller; Martin Schulz; Ben Liblit

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application - already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

international conference on parallel processing | 2012

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

Dewan Ibtesham; Dorian C. Arnold; Patrick G. Bridges; Kurt Brian Ferreira; Ron Brightwell

The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems, (2) checkpoint compression viability scales with checkpoint size, (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability, and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.

international workshop on parallel processing | 2000

The NetSolve environment: progressing towards the seamless grid

Dorian C. Arnold; Jack J. Dongarra

The NetSolve software project has matured into a robust system that reliably manages and interconnects disparate computational resources. Resource management and allocation policies, heterogeneity, fault-tolerance and security are some of the issues that need to be resolved in such environments. This article is meant to introduce the reader to the NetSolve system and offers a discussion of some of the key developments that have taken place in the project during recent months. To make the article coherent and somewhat self-contained, brief insight is given into some of the more fundamental aspects of NetSolve; the newer features and enhancements are given a more detailed discussion. For completion, there is a discussion of successful uses and integrations of the NetSolve system.

Explore More