Thomas Sterling
Indiana University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thomas Sterling.
International Conference on Exascale Applications and Software | 2014
Thomas Sterling; Matthew Anderson; P. Kevin Bohan; Maciej Brodowicz; Abhishek Kulkarni; Bo Zhang
Achieving the performance potential of an Exascale machine depends on realizing both operational efficiency and scalability in high performance computing applications. This requirement has motivated the emergence of several new programming models which emphasize fine and medium grain task parallelism in order to address the aggravating effects of asynchrony at scale. The performance modeling of Exascale systems for these programming models requires the development of fundamentally new approaches due to the demands of both scale and complexity. This work presents a performance modeling case study of the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) proxy application where the performance modeling approach has been incorporated directly into a runtime system with two modalities of operation: computation and performance modeling simulation. The runtime system exposes performance sensitivies and projects operation to larger scales while also realizing the benefits of removing global barriers and extracting more parallelism from LULESH. Comparisons between the computation and performance modeling simulation results are presented.
International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2013
Matthew Anderson; Maciej Brodowicz; Abhishek Kulkarni; Thomas Sterling
Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific computing. One possible solution for improving efficiency and scalability in applications on this class of machines is the use of a many-tasking runtime system employing many lightweight, concurrent threads. Yet a priori estimation of the potential performance and scalability impact of such runtime systems on existing applications developed around the bulk synchronous parallel (BSP) model is not well understood. In this work, we present a case study of a BSP particle-in-cell benchmark code which has been ported to a many-tasking runtime system. The 3-D Gyrokinetic Toroidal code (GTC) is examined in its original MPI form and compared with a port to the High Performance ParalleX 3 (HPX-3) runtime system. Phase overlap, oversubscription behavior, and work rebalancing in the implementation are explored. Results for GTC using the SST/macro simulator complement the implementation results. Finally, an analytic performance model for GTC is presented in order to guide future implementation efforts.
Computing in Science and Engineering | 2013
Steven Gottlieb; Thomas Sterling
The guest editors discuss some recent advances in exascale computing, as well as remaining issues.
Proceedings of The International Symposium on Grids and Clouds (ISGC) 2012 — PoS(ISGC 2012) | 2012
Thomas Sterling; Matthew Anderson
The application of emergent Clouds to the domain of high performance computing is considered by examining the various operational modalities comprising the field of supercomputing and by analyzing their suitability to Clouds based on underlying factors of performance degradation. It is found that while throughput computing may be readily supported for such HPC workflows as parameter sweeps, capability computing and even weak scaled “cooperative” computing may not be well served using conventional practices. But the possible advance of revolutionary methods to manage asynchrony, exploit message-driven computing techniques and declarative synchronization semantic constructs such as found in the experimental ParalleX execution model may provide an alternative paradigm for bringing Clouds more closely aligned to Science, Technology, Engineering, and Mathematics (STEM) applications. Experimental results capturing an Adaptive Mesh Refinement (AMR) application in numerical relativity u sing the ParalleX-based HPX-3 runtime system demonstrates many of the required properties for HPC Clouds.
Archive | 2018
Thomas Sterling; Matthew Anderson; Maciej Brodowicz
This textbook presents concepts, knowledge, and skills to provide an introductory foundation for high performance computing technology, systems, and applications. It also includes vignettes on contributors and milestone accomplishments that constitute the mainstream culture of the supercomputing field and its evolution. But due to space constraints and the limitations in time of a one-semester course, not everything that might be discussed has been incorporated. This final chapter is intended to provide a sense of the expanse of the field beyond the elements already discussed. It also takes the liberty of giving an intuition of where this most rapidly changing of fields may go in the next few years.
Archive | 2018
Thomas Sterling; Matthew Anderson; Maciej Brodowicz
Symmetric multiprocessing is the most widespread class of shared-memory compute nodes. While nodes of this type can be used as self-contained computers, they serve more frequently as a building block of larger systems such as clusters. This chapter discusses typical components of a symmetric multiprocessing node and their functions, parameters, and associated interfaces. An updated version of Amdahls law that takes overhead in account is introduced, along with other formulae determining peak computational performance of a node and the impact of memory hierarchy on a metric known as “cycles per instruction”. Finally, a number of commonly used industry-standard interfaces are discussed that permit attachment of additional peripheral devices, expansion boards, and performing input/output functions, and inspection of a nodes internal state.
Archive | 2018
Thomas Sterling; Matthew Anderson; Maciej Brodowicz
Efficient resource management is critical to achieving high computational throughput on supercomputers of all sizes, ranging from small one-rack clusters to the largest installations in the world in dedicated data centers. This needs to be accomplished while accommodating the frequently conflicting requirements of all users of the system by providing a common, flexible, and easy-to-use interface. This chapter discusses various aspects of job scheduling software, describing its place in the system, fundamental components, capabilities, and associated nomenclature. Practical aspects of interaction with resource management systems are addressed by introducing a prospective user to the essential features, properties, commands, and environments of two broadly employed open-source job management suites, SLURM and Portable Batch System, illustrated with many usage examples.
Archive | 2018
Thomas Sterling; Matthew Anderson; Maciej Brodowicz
Computer architecture is the organization of the components making up a computer system and the semantics or meaning of the operations that guide its function. As such, the computer architecture governs the design of a family of computers and defines the logical interface that is targeted by programming languages and their compilers. The organization determines the mix of functional units of which the system is composed and the structure of their interconnectivity. The architecture semantics is the meaning of what the systems do under user direction and how their functional units are controlled to work together. An important embodiment of semantics is the instruction set architecture (ISA) of the system. The ISA is a logical (usually binary) representative encoding of the basic set of distinct operations that a computer architecture may perform, and by which application programs specify the useful work to be done. At the machine level the hardware (sometimes controlled by firmware) system directly interprets and executes a sequence or partially ordered set of these basic operations. This is true for all computer cores, from those few in the smallest mobile phones to potentially millions making up the worlds largest supercomputers. High performance computer architecture extends structure to a hierarchy of functional elements, whether small and limited in capability or possibly entire processor cores themselves. In this chapter many different classes of structure are presented, each exploiting concurrency in its own particular way. But in all cases this more broad definition of general architecture for high performance computing emphasizes aspects of the system that contribute to achieving performance. A high performance computer is designed to go fast, and its organization and semantics are specially devised to deliver computational speed. This chapter introduces the basic foundations of computer architecture in general and for high performance computer systems in particular. It is here, at the structural and logical levels, that parallelism of operation in its many forms and size is first presented. This chapter provides a first examination of the principal forms of supercomputer architecture and the underlying concepts that govern their performance.
Archive | 2018
Thomas Sterling; Matthew Anderson; Maciej Brodowicz
Accelerators, most notably graphics processing units (GPUs), are an increasingly common component of large computing installations due to their high peak performance and competitive power efficiency figures. However, because of the substantially different architecture, their programming requires an intimate knowledge of the internal organization of specific devices, complicating their integration as a system component of equal utility to that of conventional CPUs and reducing code portability. The evolution of GPU programming toolkits and libraries such as Nvidias Compute Unified Device Architecture has made this task easier, but has not fully eliminated a significant learning curve and barrier to application. In contrast, OpenACC is considered to be one of the easiest accelerator programming environments to master, albeit at the cost of lower computational efficiency for certain problem types. It leverages the same directive-based approach as OpenMP, with which it also shares many keywords and concepts.
Archive | 2018
Thomas Sterling; Matthew Anderson; Maciej Brodowicz
Abstract The message-passing interface (MPI) is by far the most popular library for use in applications on distributed-memory architectures. It is a standard interface for message-passing calls, and is powerful, flexible, and usable. At the risk of hyperbole, there was probably no greater achievement of practical utility for the advancement of high performance computing (HPC) than the development of MPI. While the application programming interface itself contains hundreds of commands, the HPC practitioner generally only needs a small subset of these to create a wide array of parallel applications. This chapter introduces the key MPI calls most commonly found in HPC applications. It covers point-to-point communication, MPI types, MPI collective operations, and nonblocking point-to-point communication.