Scott Pakin
Los Alamos National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Scott Pakin.
ieee international conference on high performance computing data and analytics | 2008
Kevin J. Barker; Kei Davis; Adolfy Hoisie; Darren J. Kerbyson; Michael Lang; Scott Pakin; José Carlos Sancho
Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunners hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture---the Cell BE---and on multi-core processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured.
conference on high performance computing (supercomputing) | 2006
Adolfy Hoisie; Greg Johnson; Darren J. Kerbyson; Michael Lang; Scott Pakin
This work provides a performance analysis of three leading supercomputers that have recently been deployed: Purple, Red Storm and Blue Gene/L. Each of these machines is architecturally diverse, with very different performance characteristics. Each contains over 10,000 processors and has a system peak of over 40 Teraflops. We analyze each system using a range of micro-benchmarks which include communication performance as well as quantifying the impact of the operating system. The achievable application performance is compared across the systems. The application performance is confirmed via the use of detailed application models which use the underlying performance characteristics as measured by the micro-benchmarks. We also compare the machines in a realistic production scenario in which each machine is used so as to maximize its memory usage with the applications executed in a weak-scaling mode. The results also help illustrate that achievable performance is not directly related to the peak performance
IEEE Computer | 2009
Kevin J. Barker; Kei Davis; Adolfy Hoisie; Darren J. Kerbyson; Michael Lang; Scott Pakin; José Carlos Sancho
A methodology for accurately modeling large applications explores the performance of ultrascale systems at different stages in their life cycle, from early design through production use.
conference on high performance computing (supercomputing) | 2002
Eitan Frachtenberg; Fabrizio Petrini; Juan C. Fernandez; Scott Pakin; Salvador Coll
Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem — essentially, all of the code that runs on a cluster other than the applications — increasingly impacts application efficiency. In this paper, we present STORM, a resource-management framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling.
conference on high performance computing (supercomputing) | 2004
Kei Davis; Adolfy Hoisie; Greg Johnson; Darren J. Kerbyson; Michael Lang; Scott Pakin; Fabrizio Petrini
Based on a set of measurements done on the 512-node 500MHz prototype and early results on a 2048 node 700MHz BlueGene/L machine at IBM Watson, we present a performance and scalability analysis of the architecture from low-level characteristics to large-scale applications. In addition, we present predictions using our models for the performance of two representative applications from the ASC² workload on the full BlueGene/L configuration of 64K nodes. We have compared the measured values for several of the benchmarks in our suite against the predicted numbers from our performance models. In general, the error bars were relatively low. A comparison between the performance of BlueGene/L and the ASCI Q, the largest supercomputer in the US, is presented, also based on our predictive performance models.
international parallel and distributed processing symposium | 2008
Scott Pakin
Providing point-to-point messaging-passing semantics atop Put/Get hardware traditionally involves implementing a protocol comprising three network latencies. In this paper, we analyze the performance of an alternative implementation approach - receiver-initiated message passing - that eliminates one of the three network latencies. Performance measurements taken on the Cell Broadband Engine indicate that receiver-initiated message passing exhibits substantially lower latency than standard, sender-initiated message passing.
IEEE Transactions on Parallel and Distributed Systems | 2007
Scott Pakin
CONCEPTUAL is a toolset designed specifically to help measure the performance of high-speed interconnection networks such as those used in workstation clusters and parallel computers. It centers around a high-level domain-specific language, which makes it easy for a programmer to express, measure, and report the performance of complex communication patterns. The primary challenge in implementing a compiler for such a language is that the generated code must be extremely efficient so as not to misattribute overhead costs to the messaging library. At the same time, the language itself must not sacrifice expressiveness for compiler efficiency, or there would be little point in using a high-level language for performance testing. This paper describes the CONCEPTUAL language and the CONCEPTUAL compilers novel code-generation framework. The language provides primitives for a wide variety of idioms needed for performance testing and emphasizes a readable syntax. The core code-generation technique, based on unrolling CONCEPTUAL programs into sequences of communication events, is simple yet enables the efficient implementation of a variety of high-level constructs. The paper further explains how CONCEPTUAL implements time-bounded loops - even those that comprise blocking communication - in the absence of a time-out mechanism as this is a somewhat unique language/implementation feature.
ieee international conference on high performance computing data and analytics | 2013
Marc Gamell; Ivan Rodero; Manish Parashar; Janine C. Bennett; Hemanth Kolla; Jacqueline H. Chen; Peer-Timo Bremer; Aaditya G. Landge; Attila Gyulassy; Patrick S. McCormick; Scott Pakin; Valerio Pascucci; Scott Klasky
As scientific applications target exascale, challenges related to data and energy are becoming dominating concerns. For example, coupled simulation workflows are increasingly adopting in-situ data processing and analysis techniques to address costs and overheads due to data movement and I/O. However it is also critical to understand these overheads and associated trade-offs from an energy perspective. The goal of this paper is exploring data-related energy/performance trade-offs for end-to-end simulation workflows running at scale on current high-end computing systems. Specifically, this paper presents: (1) an analysis of the data-related behaviors of a combustion simulation workflow with an insitu data analytics pipeline, running on the Titan system at ORNL; (2) a power model based on system power and data exchange patterns, which is empirically validated; and (3) the use of the model to characterize the energy behavior of the workflow and to explore energy/performance tradeoffs on current as well as emerging systems.
international conference on supercomputing | 2011
Xing Wu; Frank Mueller; Scott Pakin
Portable parallel benchmarks are widely used and highly effective for (a) the evaluation, analysis and procurement of high-performance computing (HPC) systems and (b) quantifying the potential benefits of porting applications for new hardware platforms. Yet, past techniques to synthetically parametrized hand-coded HPC benchmarks prove insufficient for todays rapidly-evolving scientific codes particularly when subject to multi-scale science modeling or when utilizing domain-specific libraries. To address these problems, this work contributes novel methods to automatically generate highly portable and customizable communication benchmarks from HPC applications. We utilize ScalaTrace, a lossless, yet scalable, parallel application tracing framework to collect selected aspects of the run-time behavior of HPC applications. We subsequently generate benchmarks with identical run-time behavior from the collected traces in the CONCEPTUAL language, a domain-specific language that enables the expression of sophisticated communication patterns using a rich and easily understandable grammar yet compiles to ordinary C+MPI. Experimental results demonstrate that the generated benchmarks are able to preserve the run-time behavior of the original applications. This ability to automatically generate performance-accurate benchmarks from parallel applications is novel and without any precedence, to our knowledge.
international conference on cluster computing | 2007
Scott Pakin; Greg Johnson
Large-scale parallel applications often produce immense quantities of data that need to be analyzed. To avoid performing repeated, costly disk accesses, analysis of large data sets generally requires a commensurately large amount of memory. While some data-analysis tools can easily be parallelized to distribute memory across a cluster, other tools are either difficult to parallelize or, in the case of simple data-analysis scripts with short lifespans, not worth the effort to parallelize. In this work, we present and analyze the performance of JumboMem, a simple, entirely user-level parallel program that enables unmodified sequential applications to access all of the memory in a cluster. Although there are many implementations of memory servers, all require either administrative privileges or program modifications. More importantly, no existing memory server has been evaluated on modern workstation clusters with high-speed networks, many nodes, and significant quantities of memory. This paper represents the first study of memory-server performance at supercomputing scales.