Philip H. Carns
Argonne National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Philip H. Carns.
ieee conference on mass storage systems and technologies | 2012
Ning Liu; Jason Cope; Philip H. Carns; Christopher D. Carothers; Robert B. Ross; Gary Grider; Adam Crume; Carlos Maltzahn
The largest-scale high-performance (HPC) systems are stretching parallel file systems to their limits in terms of aggregate bandwidth and numbers of clients. To further sustain the scalability of these file systems, researchers and HPC storage architects are exploring various storage system designs. One proposed storage system design integrates a tier of solid-state burst buffers into the storage system to absorb application I/O requests. In this paper, we simulate and explore this storage system design for use by large-scale HPC systems. First, we examine application I/O patterns on an existing large-scale HPC system to identify common burst patterns. Next, we describe enhancements to the CODES storage system simulator to enable our burst buffer simulations. These enhancements include the integration of a burst buffer model into the I/O forwarding layer of the simulator, the development of an I/O kernel description language and interpreter, the development of a suite of I/O kernels that are derived from observed I/O patterns, and fidelity improvements to the CODES models. We evaluate the I/O performance for a set of multiapplication I/O workloads and burst buffer configurations. We show that burst buffers can accelerate the application perceived throughput to the external storage system and can reduce the amount of external storage bandwidth required to meet a desired application perceived throughput goal.
international conference on cluster computing | 2009
Nawab Ali; Philip H. Carns; Kamil Iskra; Dries Kimpe; Samuel Lang; Robert Latham; Robert B. Ross; Lee Ward; P. Sadayappan
Current leadership-class machines suffer from a significant imbalance between their computational power and their I/O bandwidth. While Moores law ensures that the computational power of high-performance computing systems increases with every generation, the same is not true for their I/O subsystems. The scalability challenges faced by existing parallel file systems with respect to the increasing number of clients, coupled with the minimalistic compute node kernels running on these machines, call for a new I/O paradigm to meet the requirements of data-intensive scientific applications. I/O forwarding is a technique that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines by shipping I/O calls from compute nodes to dedicated I/O nodes. The I/O nodes perform operations on behalf of the compute nodes and can reduce file system traffic by aggregating, rescheduling, and caching I/O requests. This paper presents an open, scalable I/O forwarding framework for high-performance computing systems. We describe an I/O protocol and API for shipping function calls from compute nodes to I/O nodes, and we present a quantitative analysis of the overhead associated with I/O forwarding.
ieee international conference on high performance computing data and analytics | 2009
Samuel Lang; Philip H. Carns; Robert Latham; Robert B. Ross; Kevin Harms; William E. Allcock
Todays top high performance computing systems run applications with hundreds of thousands of processes, contain hundreds of storage nodes, and must meet massive I/O requirements for capacity and performance. These leadership-class systems face daunting challenges to deploying scalable I/O systems. In this paper we present a case study of the I/O challenges to performance and scalability on Intrepid, the IBM Blue Gene/P system at the Argonne Leadership Computing Facility. Listed in the top 5 fastest supercomputers of 2008, Intrepid runs computational science applications with intensive demands on the I/O system. We show that Intrepids file and storage system sustain high performance under varying workloads as the applications scale with the number of processes.
international conference on cluster computing | 2009
Philip H. Carns; Robert Latham; Robert B. Ross; Kamil Iskra; Samuel Lang; Katherine Riley
Developing and tuning computational science applications to run on extreme scale systems are increasingly complicated processes. Challenges such as managing memory access and tuning message-passing behavior are made easier by tools designed specifically to aid in these processes. Tools that can help users better understand the behavior of their application with respect to I/O have not yet reached the level of utility necessary to play a central role in application development and tuning. This deficiency in the tool set means that we have a poor understanding of how specific applications interact with storage. Worse, the community has little knowledge of what sorts of access patterns are common in todays applications, leading to confusion in the storage research community as to the pressing needs of the computational science community. This paper describes the Darshan I/O characterization tool. Darshan is designed to capture an accurate picture of application I/O behavior, including properties such as patterns of access within files, with the minimum possible overhead. This characterization can shed important light on the I/O behavior of applications at extreme scale. Darshan also can enable researchers to gain greater insight into the overall patterns of access exhibited by such applications, helping the storage community to understand how to best serve current computational science applications and better predict the needs of future applications. In this work we demonstrate Darshans ability to characterize the I/O behavior of four scientific applications and show that it induces negligible overhead for I/O intensive jobs with as many as 65,536 processes.
ACM Transactions on Storage | 2011
Philip H. Carns; Kevin Harms; William E. Allcock; Charles Bacon; Samuel Lang; Robert Latham; Robert B. Ross
Computational science applications are driving a demand for increasingly powerful storage systems. While many techniques are available for capturing the I/O behavior of individual application trial runs and specific components of the storage system, continuous characterization of a production system remains a daunting challenge for systems with hundreds of thousands of compute cores and multiple petabytes of storage. As a result, these storage systems are often designed without a clear understanding of the diverse computational science workloads they will support.
international parallel and distributed processing symposium | 2009
Philip H. Carns; Samuel Lang; Robert B. Ross; Murali Vilayannur; Julian M. Kunkel; Thomas Ludwig
Todays computational science demands have resulted in ever larger parallel computers, and storage systems have grown to match these demands. Parallel file systems used in this environment are increasingly specialized to extract the highest possible performance for large I/O operations, at the expense of other potential workloads. While some applications have adapted to I/O best practices and can obtain good performance on these systems, the natural I/O patterns of many applications result in generation of many small files. These applications are not well served by current parallel file systems at very large scale. This paper describes five techniques for optimizing small-file access in parallel file systems for very large scale systems. These five techniques are all implemented in a single parallel file system (PVFS) and then systematically assessed on two test platforms. A microbenchmark and the mdtest benchmark are used to evaluate the optimizations at an unprecedented scale. We observe as much as a 905% improvement in small-file create rates, 1,106% improvement in small-file stat rates, and 727% improvement in small-file removal rates, compared to a baseline PVFS configuration on a leadership computing platform using 16,384 cores.
international conference on big data | 2014
Dongfang Zhao; Zhao Zhang; Xiaobing Zhou; Tonglin Li; Ke Wang; Dries Kimpe; Philip H. Carns; Robert B. Ross; Ioan Raicu
State-of-the-art, yet decades-old, architecture of high-performance computing systems has its compute and storage resources separated. It thus is limited for modern data-intensive scientific applications because every I/O needs to be transferred via the network between the compute and storage resources. In this paper we propose an architecture that hss a distributed storage layer local to the compute nodes. This layer is responsible for most of the I/O operations and saves extreme amounts of data movement between compute and storage resources. We have designed and implemented a system prototype of this architecture - which we call the FusionFS distributed file system - to support metadata-intensive and write-intensive operations, both of which are critical to the I/O performance of scientific applications. FusionFS has been deployed and evaluated on up to 16K compute nodes of an IBM Blue Gene/P supercomputer, showing more than an order of magnitude performance improvement over other popular file systems such as GPFS, PVFS, and HDFS.
ieee conference on mass storage systems and technologies | 2010
Seung Woo Son; Samuel Lang; Philip H. Carns; Robert B. Ross; Rajeev Thakur; Berkin Özisikyilmaz; Prabhat Kumar; Wei-keng Liao; Alok N. Choudhary
As data sizes continue to increase, the concept of active storage is well fitted for many data analysis kernels. Nevertheless, while this concept has been investigated and deployed in a number of forms, enabling it from the parallel I/O software stack has been largely unexplored. In this paper, we propose and evaluate an active storage system that allows data analysis, mining, and statistical operations to be executed from within a parallel I/O interface. In our proposed scheme, common analysis kernels are embedded in parallel file systems. We expose the semantics of these kernels to parallel file systems through an enhanced runtime interface so that execution of embedded kernels is possible on the server. In order to allow complete server-side operations without file format or layout manipulation, our scheme adjusts the file I/O buffer to the computational unit boundary on the fly. Our scheme also uses server-side collective communication primitives for reduction and aggregation using interserver communication. We have implemented a prototype of our active storage system and demonstrate its benefits using four data analysis benchmarks. Our experimental results show that our proposed system improves the overall performance of all four benchmarks by 50.9% on average and that the compute-intensive portion of the k-means clustering kernel can be improved by 58.4% through GPU offloading when executed with a larger computational load. We also show that our scheme consistently outperforms the traditional storage model with a wide variety of input dataset sizes, number of nodes, and computational loads.
ieee conference on mass storage systems and technologies | 2011
Philip H. Carns; Kevin Harms; William E. Allcock; Charles Bacon; Samuel Lang; Robert Latham; Robert B. Ross
Computational science applications are driving a demand for increasingly powerful storage systems. While many techniques are available for capturing the I/O behavior of individual application trial runs and specific components of the storage system, continuous characterization of a production system remains a daunting challenge for systems with hundreds of thousands of compute cores and multiple petabytes of storage. As a result, these storage systems are often designed without a clear understanding of the diverse computational science workloads they will support.
ieee international conference on high performance computing data and analytics | 2012
Misbah Mubarak; Christopher D. Carothers; Robert B. Ross; Philip H. Carns
A low-latency and low-diameter interconnection network will be an important component of future exascale architectures. The dragonfly network topology, a two-level directly connected network, is a candidate for exascale architectures because of its low diameter and reduced latency. To date, small-scale simulations with a few thousand nodes have been carried out to examine the dragonfly topology. However, future exascale machines will have millions of cores and up to 1 million nodes. In this paper, we focus on the modeling and simulation of large-scale dragonfly networks using the Rensselaer Optimistic Simulation System (ROSS). We validate the results of our model against the cycle-accurate simulator “booksim”. We also compare the performance of booksim and ROSS for the dragonfly network model at modest scales. We demonstrate the performance of ROSS on both the Blue Gene/P and Blue Gene/Q systems on a dragonfly model with up to 50 million nodes, showing a peak event rate of 1.33 billion events/second and a total of 872 billion committed events. The dragonfly network model for million-node configurations strongly scales when going from 1,024 to 65,536 MPI tasks on IBM Blue Gene/P and IBM Blue Gene/Q systems. We also explore a variety of ROSS tuning parameters to get optimal results with the dragonfly network model.