Scott Klasky | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Scott Klasky is active.

Explore More

Publication

Featured researches published by Scott Klasky.

ieee international conference on high performance computing, data, and analytics | 2017

Output Performance Study on a Production Petascale Filesystem

Bing Xie; Jeffrey S. Chase; David A Dillow; Scott Klasky; Jay F. Lofstead; H Sarp Oral; Norbert Podhorszki

This paper reports our observations from a top-tier supercomputer Titan and its Lustre parallel file stores under production load. In summary, we find that supercomputer file systems are highly variable across the machine at fine time scales. This variability has two major implications. First, stragglers lessen the benefit of coupled I/O parallelism (striping). Peak median output bandwidths are obtained with parallel writes to many independent files, with no striping or write-sharing of files across clients (compute nodes). I/O parallelism is most effective when the application—or its I/O middleware system—distributes the I/O load so that each client writes separate files on multiple targets, and each target stores files for multiple clients, in a balanced way. Second, our results suggest that the potential benefit of dynamic adaptation is limited. In particular, it is not fruitful to attempt to identify “good spots” in the machine or in the file system: component performance is driven by transient load conditions, and past performance is not a useful predictor of future performance. For example, we do not observe regular diurnal load patterns.

SIAM Journal on Scientific Computing | 2017

Compression Using Lossless Decimation: Analysis and Application

Mark Ainsworth; Scott Klasky; Ben Whitney

A crude but commonly used technique for compressing ordered scientific data consists of simply retaining every

modeling analysis and simulation on computer and telecommunication systems | 2017

SELF: A High Performance and Bandwidth Efficient Approach to Exploiting Die-Stacked DRAM as Part of Memory

Yuhua Guo; Qing Liu; Weijun Xiao; Ping Huang; Norbert Podhorszki; Scott Klasky; Xubin He

international conference on cluster computing | 2017

TGE: Machine Learning Based Task Graph Embedding for Large-Scale Topology Mapping

Jong Youl Choi; Jeremy Logan; Matthew Wolf; George Ostrouchov; Tahsin M. Kurç; Qing Liu; Norbert Podhorszki; Scott Klasky; Melissa Romanus; Qian Sun; Manish Parashar; R.M. Churchill; Choong-Seock Chang

th datum (with a value of

international conference on cluster computing | 2017

Canopus: A Paradigm Shift Towards Elastic Extreme-Scale Data Analytics on HPC Storage

Tao Lu; E. Suchyta; David Pugmire; Jong Choi; Scott Klasky; Qing Liu; Norbert Podhorszki; Mark Ainsworth; Matthew Wolf

s=10

international conference on cluster computing | 2017

Extending Skel to Support the Development and Optimization of Next Generation I/O Systems

Jeremy Logan; Jong Youl Choi; Matthew Wolf; George Ostrouchov; Lipeng Wan; Norbert Podhorszki; William Godoy; Scott Klasky; Erich Lohrmann; Greg Eisenhauer; Chad Wood; Kevin A. Huck

generally the default) and discarding the remainder. Should the value of a discarded datum be required afterwards, an approximation is generated by linear interpolation of the two nearest retained values. Despite the widespread use of this and similar techniques, there is little by way of theoretical analysis of their expected performance. First, we quantify the accuracy achieved by linear interpolation when approximating values discarded by decimation, obtaining both deterministic bounds in terms of appropriate smoothness measures of the data and probabilistic bounds in terms of statistics of the data. Second, we investigate the efficiency of the lossless compression scheme consisting of decimation coupled with encoding of the interpolation errors. In particular, we bound the expected compression ratio in terms of the appropriate measures of the data. Finally,...

international parallel and distributed processing symposium | 2018

Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data

Tao Lu; Qing Liu; Xubin He; Huizhang Luo; E. Suchyta; Jong Choi; Norbert Podhorszki; Scott Klasky; Matthew Wolf; Tong Liu; Zhenbo Qiao

Die-stacked DRAM (a.k.a., on-chip DRAM) provides much higher bandwidth and lower latency than off-chip DRAM. It is a promising technology to break the memory wall. Die-stacked DRAM can be used either as a cache (i.e., DRAM cache) or as a part of memory (PoM). A DRAM cache design would suffer from more page faults than a PoM design as the DRAM cache cannot contribute towards capacity of main memory. At the same time, obtaining high performance requires PoM systems to swap requested data to the die-stacked DRAM. Existing PoM designs fall into two categories – line-based and page-based. The former ensures low off-chip bandwidth utilization but suffers from a low hit ratio of on-chip memory due to limited temporal locality. In contrast, page-based designs achieve a high hit ratio of on-chip memory albeit at the cost of moving large amounts of data between on-chip and off-chip memories, leading to increased off-chip bandwidth utilization and significant system performance degradation.To achieve a similar high hit ratio of on-chip memory as page-based designs, and eliminate excessive off-chip traffic involved, we propose SELF, a high performance and bandwidth efficient approach. The key idea is to SElectively swap Lines in a requested page that are likely to be accessed according to page Footprint, instead of blindly swapping an entire page. In doing so, SELF allows incoming requests to be serviced from the on-chip memory as much as possible, while avoiding swapping unused lines to reduce memory bandwidth consumption. We evaluate a memory system which consists of 4GB on-chip DRAM and 12GB off-chip DRAM. Compared to a baseline system that has the same total capacity of 16GB off-chip DRAM, SELF improves the performance in terms of instructions per cycle by 26.9%, and reduces the energy consumption per memory access by 47.9% on average. In contrast, state-of-the-art line-based and page-based PoM designs can only improve the performance by 9.5% and 9.9%, respectively, against the same baseline system.

eurographics workshop on parallel graphics and visualization | 2018

Dense Texture Flow Visualization using Data-Parallel Primitives.

Mark Kim; Scott Klasky; David Pugmire

Task mapping is an important problem in parallel and distributed computing. The goal in task mapping is to find an optimal layout of the processes of an application (or a task) onto a given network topology. We target this problem in the context of staging applications. A staging application consists of two or more parallel applications (also referred to as staging tasks) which run concurrently and exchange data over the course of computation. Task mapping becomes a more challenging problem in staging applications, because not only data is exchanged between the staging tasks, but also the processes of a staging task may exchange data with each other. We propose a novel method, called Task Graph Embedding (TGE), that harnesses the observable graph structures of parallel applications and network topologies. TGE employs a machine learning based algorithm to find the best representation of a graph, called an embedding, onto a space in which the task-to-processor mapping problem can be solved. We evaluate and demonstrate the effectiveness of TGE experimentally with the communication patterns extracted from runs of XGC, a large-scale fusion simulation code, on Titan.

IEEE Transactions on Parallel and Distributed Systems | 2018

Harnessing Data Movement in Virtual Clusters for In-Situ Execution

Dan Huang; Qing Liu; Scott Klasky; Jun Wang; Jong Choi; Jeremy Logan; Norbert Podhorszki

Scientific simulations on high performance computing (HPC) platforms generate large quantities of data. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. However, a systematic solution to support these steps has been lacking on the current HPC software ecosystem. To that end, this paper develops Canopus, a progressive JPEGlike data management scheme for storing and analyzing big scientific data. It co-designs the data decimation, compression and data storage, taking the hardware characteristics of each storage tier into considerations. With reasonably low overhead, our approach refactors simulation data into a much smaller, reduced-accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. Data saved on different tiers can then be selectively retrieved to restore the level of accuracy that satisfies data analytics. Thus, Canopus provides a paradigm shift towards elastic data analytics and enables end users to make trade-offs between analysis speed and accuracy on-the-fly. We evaluate the impact of Canopus on unstructured triangular meshes, a pervasive data model used by scientific modeling and simulations. In particular, we demonstrate the progressive data exploration of Canopus using the “blob detection” use case on the fusion simulation data.

high performance computing and communications | 2017

Analysis and Modeling of the End-to-End I/O Performance on OLCF's Titan Supercomputer

Lipeng Wan; Matthew Wolf; Feiyi Wang; Jong Youl Choi; George Ostrouchov; Scott Klasky

As the memory and storage hierarchy get deeper and more complex, it is important to have new benchmarks and evaluation tools that allow us to explore the emerging middleware solutions to use this hierarchy. Skel is a tool aimed at automating and refining this process of studying HPC I/O performance. It works by generating application I/O kernel/benchmarks as determined by a domain-specific model. This paper provides some techniques for extending Skel to address new situations and to answer new research questions. For example, we document use cases as diverse as using Skel to troubleshoot I/O performance issues for remote users, refining an I/O system model, and facilitating the development and testing of a mechanism for runtime monitoring and performance analytics. We also discuss data oriented extensions to Skel to support the study of compression techniques for Exascale scientific data management.

Explore More