Christopher D. Carothers

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher D. Carothers is active.

Explore More

Publication

Featured researches published by Christopher D. Carothers.

ieee conference on mass storage systems and technologies | 2012

On the role of burst buffers in leadership-class storage systems

Ning Liu; Jason Cope; Philip H. Carns; Christopher D. Carothers; Robert B. Ross; Gary Grider; Adam Crume; Carlos Maltzahn

The largest-scale high-performance (HPC) systems are stretching parallel file systems to their limits in terms of aggregate bandwidth and numbers of clients. To further sustain the scalability of these file systems, researchers and HPC storage architects are exploring various storage system designs. One proposed storage system design integrates a tier of solid-state burst buffers into the storage system to absorb application I/O requests. In this paper, we simulate and explore this storage system design for use by large-scale HPC systems. First, we examine application I/O patterns on an existing large-scale HPC system to identify common burst patterns. Next, we describe enhancements to the CODES storage system simulator to enable our burst buffer simulations. These enhancements include the integration of a burst buffer model into the I/O forwarding layer of the simulator, the development of an I/O kernel description language and interpreter, the development of a suite of I/O kernels that are derived from observed I/O patterns, and fidelity improvements to the CODES models. We evaluate the I/O performance for a set of multiapplication I/O workloads and burst buffer configurations. We show that burst buffers can accelerate the application perceived throughput to the external storage system and can reduce the amount of external storage bandwidth required to meet a desired application perceived throughput goal.

workshop on parallel and distributed simulation | 2000

ROSS: a high-performance, low memory, modular time warp system

Christopher D. Carothers; David W. Bauer; Shawn Pearce

We introduce a new time warp system called ROSS: Rensselaers Optimistic Simulation System. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events per second when simulating a wireless telephone network model (PCS) on a quad processor PC server. In a head-to-head comparison, we observe that ROSS out performs the Georgia Tech Time Warp (GTW) system on the same computing platform by up to 180%. ROSS only requires a small constant amount of memory buffers greater than the amount needed by the sequential simulation for a constant number of processors. The driving force behind these high-performance and low memory utilization results is the coupling of an efficient pointer-based implementation framework, Fujimotos (1989) fast GVT algorithm for shared memory multiprocessors, reverse computation and the introduction of kernel processes (KPs). KPs lower fossil collection overheads by aggregating processed event lists. This aspect allows fossil collection to be done with greater frequency, thus lowering the overall memory necessary to sustain stable, efficient parallel execution.

Journal of Parallel and Distributed Computing | 2002

ROSS: A high-performance, low-memory, modular Time Warp system

Christopher D. Carothers; David W. Bauer; Shawn Pearce

Abstract In this paper, we introduce a new Time Warp system called ROSS: Rensselaer’s optimistic simulation system. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events per second when simulating a wireless telephone network model (PCS) on a quad processor PC server. In a head-to-head comparison, we observe that ROSS out performs the Georgia Tech Time Warp (GTW) system by up to 180% on a quad processor PC server and up to 200% on the SGI Origin 2000. ROSS only requires a small constant amount of memory buffers greater than the amount needed by the sequential simulation for a constant number of processors. ROSS demonstrates for the first time that stable, highly efficient execution using little memory above what the sequential model would require is possible for low-event granularity simulation models. The driving force behind these high-performance and low-memory utilization results is the coupling of an efficient pointer-based implementation framework, Fujimotos fast GVT algorithm for shared memory multiprocessors, reverse computation and the introduction of kernel processes (KPs). KPs lower fossil collection overheads by aggregating processed event lists. This aspect allows fossil collection to be done with greater frequency, thus lowering the overall memory necessary to sustain stable, efficient parallel execution. These characteristics make ROSS an ideal system for use in large-scale networking simulation models. The principle conclusion drawn from this study is that the performance of an optimistic simulator is largely determined by its memory usage.

workshop on parallel and distributed simulation | 1995

A case study in simulating PCS networks using Time Warp

Christopher D. Carothers; Richard M. Fujimoto; Yi-Bing Lin

There has been rapid growth in the demand for mobile communications over the past few years. This has led to intensive research and development efforts for complex PCS (personal communication service) networks. Capacity planning and performance modeling is necessary to maintain a high quality of service to the mobile subscriber while minimizing cost to the PCS provider. The need for flexible analysis tools and the high computational requirements of large PCS network simulations make it an excellent candidate for parallel simulation. Here, we describe our experiences in developing two PCS simulation models on a general purpose distributed simulation platform based on the Time Warp mechanism. These models utilize two widely used approaches to simulating PCS networks: (i) the call-initiated and (ii) the portable-initiated models. We discuss design decisions that were made in mapping these models to the Time Warp executive, and characterize the workloads resulting from these models in terms of factors such as communication locality and computation granularity. We then present performance measurements for their execution on a network of workstations. These measurements indicate that the call-initiated model generally outperforms the portable initiated model, but is not able to capture phenomenon such as the “busy line” effect. Moreover, these studies indicate that the high locality in large-scale PCS network simulations make them well-suited for execution on general purpose parallel and distributed simulation platforms.

principles of advanced discrete simulation | 2013

Warp speed: executing time warp on 1,966,080 cores

Peter D. Barnes; Christopher D. Carothers; David R. Jefferson; Justin M. LaPre

Time Warp is an optimistic synchronization protocol for parallel discrete event simulation that coordinates the available parallelism through its rollback and antimessage mechanisms. In this paper we present the results of a strong scaling study of the ROSS simulator running Time Warp with reverse computation and executing the well-known PHOLD benchmark on Lawrence Livermore National Laboratorys Sequoia Blue Gene/Q supercomputer. The benchmark has 251 million PHOLD logical processes and was executed in several configurations up to a peak of 7.86 million MPI tasks running on 1,966,080 cores. At the largest scale it processed 33 trillion events in 65 seconds, yielding a sustained speed of 504 billion events/second using 120 racks of Sequoia. This is by far the highest event rate reported by any parallel discrete event simulation to date, whether running PHOLD or any other benchmark. Additionally, we believe it is likely to be the largest number of MPI tasks ever used in any computation of any kind to date. ROSS exhibited a super-linear speedup throughout the strong scaling study, with more than a 97x speed improvement from scaling the number of cores by only 60x (from 32,768 to 1,966,080). We attribute this to significant cache-related performance acceleration as we moved to higher scales with fewer LPs per core. Prompted by historical performance results we propose a new, long term performance metric called Warp Speed that grows logarithmically with the PHOLD event rate. As we define it our maximum speed of 504 billion PHOLD events/sec corresponds to Warp 2.7. We suggest that the results described here are significant because they demonstrate that direct simulation of planetary-scale discrete event models are now, in principle at least, within reach.

modeling, analysis, and simulation on computer and telecommunication systems | 1994

Distributed simulation of large-scale PCS networks

Christopher D. Carothers; Richard M. Fujimoto; Yi-Bing Lin; Paul England

There has been rapid growth an demand for mobile communications over the past few years that has led to intensive research and development of complex PCS (personal communication service) networks. Capacity planning and performance modeling are necessary to maintain a high quality of service to the mobile subscriber while minimizing the cost. Simulation is widely used in such studies, however, because these models are extremely time consuming to execute, only small-scale PCS networks have previously been simulated. In this paper, we examine the use of the Time Warp distributed simulation mechanism in simulating large scale (1024 or more cells) PCS networks. An object-oriented, distributed, discrete event simulator using Time Warp has been developed, and initial performance measurements completed. Speedups in the range of 2.8 to 7.8 using 8 Unix workstations have been obtained, enabling simulation runs that require 20 hours on a single workstation to be completed in only 3.5 hours.<<ETX>>

workshop on parallel and distributed simulation | 1999

Efficient optimistic parallel simulations using reverse computation

Christopher D. Carothers; Kaylan S. Perumalla; Richard M. Fujimoto

In optimistic parallel simulations, state-saving techniques have been traditionally used to realize rollback. We propose reverse computation as an alternative approach, and compare its execution performance against that of state-saving. Using compiler techniques, we describe an approach to automatically generate reversible computations, and to optimize them to transparently reap the performance benefits of reverse computation. For certain fine-grain models, such as queuing network models, we show that reverse computation can yield significant improvement in execution speed coupled with significant reduction in memory utilization, as compared to traditional state-saving. On sample models using reverse computation, we observe as much as three-fold improvement in execution speed over traditional state-saving.

acm special interest group on data communication | 2003

Large-scale network simulation techniques: examples of TCP and OSPF models

Garrett R. Yaun; David W. Bauer; Harshad L. Bhutada; Christopher D. Carothers; Murat Yuksel; Shivkumar Kalyanaraman

Simulation of large-scale networks remains to be a challenge, although various network simulators are in place. In this paper, we identify fundamental issues for large-scale networks simulation, and porpose new techniques that address them. First, we exploit optimistic parallel simulation techniques to enable fast execution on inexpensive hyper-threaded, multiprocessor systems. Second, we provide a compact, light-weight implementation framework that greatly reduces the amount of state required to simulate large-scale network models. Based on the proposed techniques, we provide sample simulation models for two networking protocols: TCP and OSPF. We implement these models in a simulation environment ROSSNet, which is an extension to the previously developed optimistic simulator ROSS. We perform validation experoments for TCP and OSPF and present performance reuslts of our techniques by simulating OSPF and TCP on a large and realistic topology, such as AT&Ts US network based on rocketfuel data. The end result of these innovations is that we are able to simulate million node network tolopgies using inexpensive commercial off-the-shelf hyper-threaded multiprocessor systems consuming less than 1.4 GB of RAM in total.

Journal of Parallel and Distributed Computing | 2009

An analysis of clustered failures on large supercomputing systems

Thomas J. Hacker; Fabian Romero; Christopher D. Carothers

Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.

workshop on parallel and distributed simulation | 1996

Background execution of time warp programs

Christopher D. Carothers; Richard M. Fujimoto

A load distribution system is proposed to enable a single Time War progmm to execute in background, s readang over a colfection of possibly heterogeneous worlstations (including multiprocessor hosts), utilizing whatever otherwise unused CPU cycles are available. The system uses a simple processor allocation policy to dynamicall add or delete hosts from the set of processors utilized b tbe Time Warp progmm during its execution. A load bayancing algorithm as used that allocates logical processes (LPs) to processors, taking into account other computations executing on the host from the system or other user applications. A clustering mechanism is used to group collections of lo ical processes, together, ,reducing process migration overleads and helping to retain locality of communacatzon for simulations containing large number of LPs. An initial, prototy e implementation of the load distribution system is Lscribed that executes on, a homogeneous network of Silicon Gmphics workstatzons. Initial experiments indicate this ap roach shows promise in enabling e cient execution of &me Warp programs ”in backgroung on distributed computing platforms.

Explore More