Jason J Hill
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jason J Hill.
ieee international conference on high performance computing data and analytics | 2014
Sarp Oral; James A Simmons; Jason J Hill; Dustin B Leverman; Feiyi Wang; Matt Ezell; Ross Miller; Douglas Fuller; Raghul Gunasekaran; Young-Jae Kim; Saurabh Gupta; Devesh Tiwari; Sudharshan S. Vazhkudai; James H. Rogers; David A Dillow; Galen M. Shipman; Arthur S. Bland
The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies. This paper provides an account of our experience and lessons learned in acquiring, deploying, and operating large-scale parallel file systems. We believe that these lessons will be useful to the wider HPC community.
petascale data storage workshop | 2015
Raghul Gunasekaran; Sarp Oral; Jason J Hill; Ross Miller; Feiyi Wang; Dustin B Leverman
The Oak Ridge Leadership Computing Facility (OLCF) is a leader in large-scale parallel file system development, design, deployment and continuous operation. For the last decade, the OLCF has designed and deployed two large center-wide parallel file systems. The first instantiation, Spider 1, served the Jaguar supercomputer and its predecessor, Spider 2, now serves the Titan supercomputer, among many other OLCF computational resources. The OLCF has been rigorously collecting file and storage system statistics from these Spider systems since their transition to production state. In this paper we present the collected I/O workload statistics from the Spider 2 system and compare it to the Spider 1 data. Our analysis show that the Spider 2 workload is more more write-heavy I/O compared to Spider 1 (75% vs. 60%, respectively). The data also show the OLCF storage policies such as periodic purges are effectively managing the capacity resource of Spider 2. Furthermore, due to improvements in tdm_multipath and ib_srp software, we are utilizing the Spider 2 system bandwidth and latency resources more effectively. The Spider 2 bandwidth usage statistics shows that our system is working within the design specifications. However, it is also evident that our scientific applications can be more effectively served by a burst buffer storage layer. All the data has been collected by monitoring tools developed for the Spider ecosystem. We believe the observed data set and insights will help us better design the next-generation Spider file and storage system. It will also be helpful to the larger community for building more effective large-scale file and storage systems.
petascale data storage workshop | 2013
Feiyi Wang; Mark Nelson; Sarp Oral; Scott Atchley; Sage A. Weil; Bradley W. Settlemyer; Blake A Caldwell; Jason J Hill
Ceph is an emerging open-source parallel distributed file and storage system. By design, Ceph leverages unreliable commodity storage and network hardware, and provides reliability and fault-tolerance via controlled object placement and data replication. This paper presents our file and block I/O performance and scalability evaluation of Ceph for scientific high-performance computing (HPC) environments. Our work makes two unique contributions. First, our evaluation is performed under a realistic setup for a large-scale capability HPC environment using a commercial high-end storage system. Second, our path of investigation, tuning efforts, and findings made direct contributions to Cephs development and improved code quality, scalability, and performance. These changes should benefit both Ceph and the HPC community at large.
network aware data management | 2013
Hai Ah Nam; Jason J Hill; Suzanne T Parete-Koon
The importance of computing facilities is heralded every six months with the announcement of the new Top500 list, showcasing the worlds fastest supercomputers. Unfortunately, with great computing capability does not come great long-term data storage capacity, which often means users must move their data to their local site archive, to remote sites where they may be doing future computation or analysis, or back to their home institution, else face the dreaded data purge that most HPC centers employ to keep utilization of large parallel filesystems low to manage performance and capacity. At HPC centers, data transfer is crucial to the scientific workflow and will increase in importance as computing systems grow in size. The Energy Sciences Network (ESnet) recently launched its fifth generation network, a 100 Gbps high-performance, unclassified national network connecting more than 40 DOE research sites to support scientific research and collaboration. Despite the tenfold increase in bandwidth to DOE research sites amenable to multiple data transfer streams and high throughput, in practice, researchers often under-utilize the network and resort to painfully-slow single stream transfer methods such as scp to avoid the complexity of using multiple stream tools such as GridFTP and bbcp, and contend with frustration from the lack of consistency of available tools between sites. In this study we survey and assess the data transfer methods provided at several DOE supported computing facilities, including both leadership-computing facilities, connected through ESnet. We present observed transfer rates, suggested optimizations, and discuss the obstacles the tools must overcome to receive wide-spread adoption over scp.
Archive | 2016
Pietro Cicotti; Sarp Oral; Gokcen Kestor; Roberto Gioiosa; Shawn Strande; James H. Rogers; Hasan Abbasi; Jason J Hill; Laura Carrington
The cost of executing a floating point operation has been decreasing for decades at a much higher rate than that of moving data. Bandwidth and latency, two key metrics that determine the cost of moving data, have degraded significantly relative to processor cycle time and execution rate. Despite the limitation of sub-micron processor technology and the end of Dennard scaling, this trend will continue in the short-term making data movement a performance-limiting factor and an energy/power efficiency concern. Even more so in the context of large-scale and data-intensive systems and workloads. This chapter gives an overview of the aspects of moving data across a system, from the storage system to the computing system down to the node and processor level, with case study and contributions from researchers at the San Diego Supercomputer Center, the Oak Ridge National Laboratory, the Pacific Northwest National Laboratory, and the University of Delaware.
Archive | 2010
Ross Miller; Jason J Hill; David A Dillow; Raghul Gunasekaran; Don Maxwell
Archive | 2010
Galen M. Shipman; David A Dillow; Sarp Oral; Feiyi Wang; Douglas Fuller; Jason J Hill; Zhe Zhang
Archive | 2012
David A Dillow; Douglas Fuller; Raghul Gunasekaran; Young-Jae Kim; H Sarp Oral; Doug M Reitz; James A Simmons; Feiyi Wang; Galen M. Shipman; Jason J Hill
Archive | 2014
Matthew A Ezell; David A Dillow; H Sarp Oral; Feiyi Wang; Devesh Tiwari; Don Maxwell; Dustin B Leverman; Jason J Hill
Archive | 2010
Raghul Gunasekaran; David A Dillow; Galen M. Shipman; Don Maxwell; Jason J Hill; Byung H. Park; Al Geist