Jeremy Logan
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jeremy Logan.
Concurrency and Computation: Practice and Experience | 2014
Qing Liu; Jeremy Logan; Yuan Tian; Hasan Abbasi; Norbert Podhorszki; Jong Youl Choi; Scott Klasky; Roselyne Tchoua; Jay F. Lofstead; Ron A. Oldfield; Manish Parashar; Nagiza F. Samatova; Karsten Schwan; Arie Shoshani; Matthew Wolf; Kesheng Wu; Weikuan Yu
Applications running on leadership platforms are more and more bottlenecked by storage input/output (I/O). In an effort to combat the increasing disparity between I/O throughput and compute capability, we created Adaptable IO System (ADIOS) in 2005. Focusing on putting users first with a service oriented architecture, we combined cutting edge research into new I/O techniques with a design effort to create near optimal I/O methods. As a result, ADIOS provides the highest level of synchronous I/O performance for a number of mission critical applications at various Department of Energy Leadership Computing Facilities. Meanwhile ADIOS is leading the push for next generation techniques including staging and data processing pipelines. In this paper, we describe the startling observations we have made in the last half decade of I/O research and development, and elaborate the lessons we have learned along this journey. We also detail some of the challenges that remain as we look toward the coming Exascale era. Copyright
high performance distributed computing | 2009
Phillip M. Dickens; Jeremy Logan
It is widely known that MPI-IO performs poorly in a Lustre file system environment, although the reasons for such performance are currently not well understood. The research presented in this paper strongly supports our hypothesis that MPI-IO performs poorly in this environment because of the fundamental assumptions upon which most parallel I/O optimizations are based. In particular, it is almost universally believed that parallel I/O performance is optimized when aggregator processes perform large, contiguous I/O operations in parallel. Our research shows that this approach generally provides the worst performance in a Lustre environment, and that the best performance is often obtained when the aggregator processes perform a large number of small, non-contiguous I/O operations. In this paper, we first demonstrate and explain these non-intuitive results. We then present a user-level library, termed Y-lib, which redistributes data in a way that conforms much more closely with the Lustre storage architecture than does the data redistribution pattern employed by MPI-IO. We then provide experimental results showing that Y-lib can increase performance between 300% and 1000% depending on the number of aggregator processes and file size. Finally, we cause MPI-IO itself to use our data redistribution scheme, and show that doing so results in an increase in performance of a similar magnitude when compared to the current MPI-IO data redistribution algorithms.
international conference on parallel processing | 2012
Jeremy Logan; Scott Klasky; Hasan Abbasi; Qing Liu; George Ostrouchov; Manish Parashar; Norbert Podhorszki; Yuan Tian; Matthew Wolf
We address the difficulty involved in obtaining meaningful measurements of I/O performance in HPC applications, as well as the further challenge of understanding the causes of I/O bottlenecks in these applications. The need for I/O optimization is critical given the difficulty in scaling I/O to ever increasing numbers of processing cores. To address this need, we have pioneered a new approach to the analysis of I/O performance using automatic generation of I/O benchmark codes given a high-level description of an applications I/O pattern. By combining this with low-level characterization of the performance of the various components of the underlying I/O method we are able to produce a complete picture of the I/O behavior of an application. We compare the performance measurements obtained using Skel, the tool that implements our approach, with those of an instrumented version of the original application to show that our approach is accurate. We demonstrate the use of Skel to compare the performance of several I/O methods. Finally we show that the detailed breakdown of timing information produced by Skel provides better understanding of the reasons for the performance differences between the examined I/O methods. We conclude that our approach facilitates faster, more accurate and more meaningful I/O performance testing, allowing application I/O performance to be predicted, and new systems and I/O methods to be evaluated.
OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems: | 2008
Phillip M. Dickens; Jeremy Logan
Lustre is becoming an increasingly important file system for large-scale computing clusters. The problem is that many data-intensive applications use MPI-IO for their I/O requirements, and it has been well documented that MPI-IO performs poorly in a Lustre file system environment. However, the reasons for such poor performance are not currently well understood. We believe that the primary reason for poor performance is that the assumptions underpinning most of the parallel I/O optimizations implemented in MPI-IO do not hold in a Lustre environment. Perhaps the most important assumption that appears to be incorrect is that optimal performance is obtained by performing large, contiguous I/O operations. Our research suggests that this is often the worst approach to take in a Lustre file system. In fact, we found that the best performance is sometimes achieved when each process performs a series of smaller, non-contiguous I/O requests. In this paper, we provide experimental results showing that such assumptions do not apply in Lustre, and explore new approaches that appear to provide significantly better performance.
ieee conference on mass storage systems and technologies | 2013
Yuan Tian; Zhuo Liu; Scott Klasky; Bin Wang; Hasan Abbasi; Shujia Zhou; Norbert Podhorszki; Tom Clune; Jeremy Logan; Weikuan Yu
In the era of petascale computing, more scientific applications are being deployed on leadership scale computing platforms to enhance the scientific productivity. Many I/O techniques have been designed to address the growing I/O bottleneck on large-scale systems by handling massive scientific data in a holistic manner. While such techniques have been leveraged in a wide range of applications, they have not been shown as adequate for many mission critical applications, particularly in data postprocessing stage. One of the examples is that some scientific applications generate datasets composed of a vast amount of small data elements that are organized along many spatial and temporal dimensions but require sophisticated data analytics on one or more dimensions. Including such dimensional knowledge into data organization can be beneficial to the efficiency of data post-processing, which is often missing from exiting I/O techniques. In this study, we propose a novel I/O scheme named STAR (Spatial and Temporal AggRegation) to enable high performance data queries for scientific analytics. STAR is able to dive into the massive data, identify the spatial and temporal relationships among data variables, and accordingly organize them into an optimized multi-dimensional data structure before storing to the storage. This technique not only facilitates the common access patterns of data analytics, but also further reduces the application turnaround time. In particular, STAR is able to enable efficient data queries along the time dimension, a practice common in scientific analytics but not yet supported by existing I/O techniques. In our case study with a critical climate modeling application GEOS-5, the experimental results on Jaguar supercomputer demonstrate an improvement up to 73 times for the read performance compared to the original I/O method.
international conference on cluster computing | 2008
Jeremy Logan; Phillip M. Dickens
Lustre is becoming an increasingly important file system for large-scale computing clusters. The problem, however, is that many data-intensive applications use MPI-IO for their I/O requirements, and MPI-IO performs poorly in a Lustre file system environment. While this poor performance has been well documented, the reasons for such performance are currently not well understood. Our research suggests that the primary performance issues have to do with the assumptions underpinning most of the parallel I/O optimizations implemented in MPI-IO, which do not appear to hold in a Lustre environment. Perhaps the most important assumption is that optimal performance is obtained by performing large, contiguous I/O operations. However, the research results presented in this poster show that this is often the worst approach to take in a Lustre file system. In fact, we found that the best performance is often achieved when each process performs a series of smaller, non-contiguous I/O requests. In this poster, we provide experimental results supporting these non-intuitive ideas, and provide alternative approaches that significantly enhance the performance of MPI-IO in a Lustre file system.
computational science and engineering | 2013
Stephen Herbein; M. Matheny; Matthew Wezowicz; Jaron T. Krogel; Jeremy Logan; Jeongnim Kim; Scott Klasky
Traditional petascale applications, such as QMCPack, can scale their computations to completely utilize modern supercomputers like Titan, but they cannot scale their I/O. To preserve scalability, scientists cannot save data at the granularity needed to enable scientific discovery and are forced to use large intervals between two checkpoint calls. In this paper, we work to increase the granularity of the I/O in QMCPack simulations without increasing the I/O associated overhead or compromising the scalability of the simulations. Our solution redesigns the I/O algorithms used by QMCPack to gather finer-grained data at high frequencies and integrate the ADIOS API to select effective I/O methods without major code changes. The extension of a tool such as Skel to mimic the variable I/O in QMCPack allows us to predict the I/O performance of the code when using ADIOS methods at the petascale. We show how I/O libraries like ADIOS allow us to increase the amount of scientific data extracted from QMCPack simulations at the granularity desired by the scientists while keeping the I/O overhead below 10%. We also show how the impact of checkpoint I/O for the QMCPack code using ADIOS is below 5% when using preventive tactics for check pointing at the petascale and beyond.
IEEE Design & Test of Computers | 2014
Samuel Schlachter; Stephen Herbein; Shuching Ou; Jeremy Logan; Sandeep Patel
To address the challenge of pursuing coordinated trajectory progression and efficient resource utilization of GPU-enabled molecular dynamics (MD) simulations on nondedicated high-end clusters, our work aims to supplement, rather than rewrite, existing workflow and resource managers. To this end, we propose a companion module that complements workflow managers and a wrapper module that supports resource managers. We model the maximum utilization of our approach in comparison to the traditional common approach for two molecular simulations: a sodium dodecyl sulfate (SDS) system with dynamically variable job runtimes and a carbon nanotube system with hardware and application failures. In light of our solution, we estimate increased utilization in both simulations. These findings implicitly assure a more coordinated progression of long trajectories across GPUs on a clusters nodes.
international conference on distributed computing systems | 2017
Scott Klasky; E. Suchyta; Mark Ainsworth; Qing Liu; Ben Whitney; Matthew Wolf; Jong Youl Choi; Ian T. Foster; Mark Kim; Jeremy Logan; Kshitij Mehta; Todd S. Munson; George Ostrouchov; Manish Parashar; Norbert Podhorszki; David Pugmire; Lipeng Wan
As we continue toward exascale, scientific data volume is continuing to scale and becoming more burdensome to manage. In this paper, we lay out opportunities to enhance state of the art data management techniques. We emphasize well-principled data compression, and using it to achieve progressive refinement. This can both accelerate I/O and afford the user increased flexibility when she interacts with the data. The formulation naturally maps onto enabling partitioning of the progressively improving-quality representations of a data quantity into different media-type destinations, to keep the highest priority information as close as possible to the computation, and take advantage of deepening memory/storage hierarchies in ways not previously possible. Careful monitoring is requisite to our vision, not only to verify that compression has not eliminated salient features in the data, but also to better understand the performance of massively parallel scientific applications. Increased mathematical rigor would be ideal,to help bring compression on a better-understood theoretical footing, closer to the relevant scientific theory, more aware of constraints imposed by the science, and more tightly error-controlled. Throughout, we highlight pathfinding research we have begun exploring related these topics, and comment toward future work that will be needed.
intelligent data acquisition and advanced computing systems: technology and applications | 2009
Jeremy Logan; Phillip M. Dickens
Our research has been investigating a new approach to parallel I/O based on what we term objects. The premise of this research is that the primary obstacle to scalable I/O is the legacy view of a file as a linear sequence of bytes. The problem is that applications rarely access their data in a way that conforms to this data model, using instead what may be termed an object model, where each process accesses a (perhaps disjoint) collection of objects. We have developed an object-based caching system that provides an interface between MPI applications and a more powerful object file model, and have demonstrated significant performance gains based on this new approach. In this paper, we further explore the advantages that can be gained from using object-based I/O. In particular, we demonstrate that parallel I/O based on objects (termed parallel object I/O) can be dynamically remapped. That is, one application can output an object stream based on one object set, this can be captured and translated into a different object set that is more appropriate for another application. We demonstrate how such remapping can be accomplished, and provide an example application showing that using this technique can significantly improve I/O performance.