Is this you? Create Your Porfile

Bill Kramer

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bill Kramer is active.

Explore More

Publication

Featured researches published by Bill Kramer.

ieee international conference on high performance computing data and analytics | 2011

The International Exascale Software Project roadmap

Jack J. Dongarra; Pete Beckman; Terry Moore; Patrick Aerts; Giovanni Aloisio; Jean Claude Andre; David Barkai; Jean Yves Berthou; Taisuke Boku; Bertrand Braunschweig; Franck Cappello; Barbara M. Chapman; Xuebin Chi; Alok N. Choudhary; Sudip S. Dosanjh; Thom H. Dunning; Sandro Fiore; Al Geist; Bill Gropp; Robert J. Harrison; Mark Hereld; Michael A. Heroux; Adolfy Hoisie; Koh Hotta; Zhong Jin; Yutaka Ishikawa; Fred Johnson; Sanjay Kale; R.D. Kenway; David E. Keyes

Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

Journal of Physics: Conference Series | 2007

The open science grid

R. Pordes; D. Petravick; Bill Kramer; Doug Olson; Miron Livny; Alain Roy; P. Avery; K. Blackburn; Torre Wenaus; F. Würthwein; Ian T. Foster; Robert Gardner; Michael Wilde; Alan Blatecky; John McGee; Rob Quick

The Open Science Grid (OSG) provides a distributed facility where the Consortium members provide guaranteed and opportunistic access to shared computing and storage resources. OSG provides support for and evolution of the infrastructure through activities that cover operations, security, software, troubleshooting, addition of new capabilities, and support for existing and engagement with new communities. The OSG SciDAC-2 project provides specific activities to manage and evolve the distributed infrastructure and support its use. The innovative aspects of the project are the maintenance and performance of a collaborative (shared & common) petascale national facility over tens of autonomous computing sites, for many hundreds of users, transferring terabytes of data a day, executing tens of thousands of jobs a day, and providing robust and usable resources for scientific groups of all types and sizes. More information can be found at the OSG web site: www.opensciencegrid.org.

ieee international conference on high performance computing data and analytics | 2009

Toward Exascale Resilience

Franck Cappello; Al Geist; Bill Gropp; Laxmikant V. Kalé; Bill Kramer; Marc Snir

Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

ieee international conference on high performance computing data and analytics | 2011

Modeling and tolerating heterogeneous failures in large parallel systems

Eric Martin Heien; Derrick Kondo; Ana Gainaru; Dan Lapine; Bill Kramer; Franck Cappello

As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time. To this end, we study 5 years of system logs from a production high-performance computing system and model hard ware failures involving processors, memory, storage and net work components. We model each component and construct integrated failure models given the component us age of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.

international conference on parallel processing | 2011

Event log mining tool for large scale HPC systems

Ana Gainaru; Franck Cappello; Stefan Trausan-Matu; Bill Kramer

Event log files are the most common source of information for the characterization of events in large scale systems. However the large size of these files makes the task of manual analysing log messages to be difficult and error prone. This is the reason why recent research has been focusing on creating algorithms for automatically analysing these log files. In this paper we present a novel methodology for extracting templates that describe event formats from large datasets presenting an intuitive and user-friendly output to system administrators. Our algorithm is able to keep up with the rapidly changing environments by adapting the clusters to the incoming stream of events. For testing our tool, we have chosen 5 log files that have different formats and that challenge different aspects in the clustering task. The experiments show that our tool outperforms all other algorithms in all tested scenarios achieving an average precision and recall of 0.9, increasing the correct number of groups by a factor of 1.5 and decreasing the number of false positives and negatives by an average factor of 4.

arXiv: Computational Physics | 2008

New science on the Open Science Grid

R. Pordes; Mine Altunay; P. Avery; Alina Bejan; K. Blackburn; Alan Blatecky; Robert Gardner; Bill Kramer; Miron Livny; John McGee; Maxim Potekhin; Rob Quick; Doug Olson; Alain Roy; Chander Sehgal; Torre Wenaus; Michael Wilde; F. Würthwein

The Open Science Grid (OSG) includes work to enable new science, new scientists, and new modalities in support of computationally based research. There are frequently significant sociological and organizational changes required in transformation from the existing to the new. OSG leverages its deliverables to the large-scale physics experiment member communities to benefit new communities at all scales through activities in education, engagement, and the distributed facility. This paper gives both a brief general description and specific examples of new science enabled on the OSG. More information is available at the OSG web site: www.opensciencegrid.org.

Presented at International Conference on Computing in High Energy and Nuclear Physics (CHEP 07), Victoria, BC, Canada, 2-7 Sep 2007 | 2008

The Open Science Grid status and architecture

The Open Science Grid (OSG) provides a distributed facility where the Consortium members provide guaranteed and opportunistic access to shared computing and storage resources. The OSG project[1] is funded by the National Science Foundation and the Department of Energy Scientific Discovery through Advanced Computing program. The OSG project provides specific activities for the operation and evolution of the common infrastructure. The US ATLAS and US CMS collaborations contribute to and depend on OSG as the US infrastructure contributing to the World Wide LHC Computing Grid on which the LHC experiments distribute and analyze their data. Other stakeholders include the STAR RHIC experiment, the Laser Interferometer Gravitational-Wave Observatory (LIGO), the Dark Energy Survey (DES) and several Fermilab Tevatron experiments- CDF, D0, MiniBoone etc. The OSG implementation architecture brings a pragmatic approach to enabling vertically integrated community specific distributed systems over a common horizontal set of shared resources and services. More information can be found at the OSG web site: www.opensciencegrid.org.

Lawrence Berkeley National Laboratory | 2006

Software Roadmap to Plug and Play Petaflop/s

Bill Kramer; Jonathan Carter; David Skinner; Lenny Oliker; Parry Husbands; Paul Hargrove; John Shalf; Osni Marques; Esmond G. Ng; Tony Drummond; Katherine A. Yelick

In the next five years, the DOE expects to build systemsthat approach a petaflop in scale. In the near term (two years), DOE willhave several near-petaflops systems that are 10 percent to 25 percent ofa peraflop-scale system. A common feature of these precursors to petaflopsystems (such as the Cray XT3 or the IBM BlueGene/L) is that they rely onan unprecedented degree of concurrency, which puts stress on every aspectof HPC system design. Such complex systems will likely break current bestpractices for fault resilience, I/O scaling, and debugging, and evenraise fundamental questions about languages and application programmingmodels. It is important that potential problems are anticipated farenough in advance that they can be addressed in time to prepare the wayfor petaflop-scale systems. This report considers the following fourquestions: (1) What software is on a critical path to make the systemswork? (2) What are the strengths/weaknesses of the vendors and ofexisting vendor solutions? (3) What are the local strengths at the labs?(4) Who are other key players who will play a role and canhelp?

International Journal of High Performance Computing Applications | 2018

Big data and extreme-scale computing: Pathways to Convergence-Toward a shaping strategy for a future software and data ecosystem for scientific inquiry

M. Asch; Terry Moore; Rosa M. Badia; Micah Beck; Peter H. Beckman; T. Bidot; François Bodin; Franck Cappello; Alok N. Choudhary; Bronis R. de Supinski; Ewa Deelman; Jack J. Dongarra; Anshu Dubey; Geoffrey C. Fox; H. Fu; Sergi Girona; William D. Gropp; Michael A. Heroux; Yutaka Ishikawa; Katarzyna Keahey; David E. Keyes; Bill Kramer; J.-F. Lavignon; Y. Lu; Satoshi Matsuoka; Bernd Mohr; Daniel A. Reed; S. Requena; Joel H. Saltz; Thomas C. Schulthess

Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.

Supercomputing Frontiers and Innovations: an International Journal archive | 2014