Thomas L. Sterling | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas L. Sterling is active.

Explore More

Publication

Featured researches published by Thomas L. Sterling.

ieee aerospace conference | 1997

Beowulf: harnessing the power of parallelism in a pile-of-PCs

D. Ridge; Donald J. Becker; Phillip Merkey; Thomas L. Sterling

The rapid increase in performance of mass market commodity microprocessors and significant disparity in pricing between PCs and scientific workstations has provided an opportunity for substantial gains in performance to cost by harnessing PC technology in parallel ensembles to provide high end capability for scientific and engineering applications. The Beowulf project is a NASA initiative sponsored by the HPCC program to explore the potential of Pile-of-PCs and to develop the necessary methodologies to apply these low cost system configurations to NASA computational requirements in the Earth and space sciences. This paper describes the technologies and methodologies employed to achieve the increased performance of PCs. Both opportunities afforded by this approach and the challenges confronting its application to real-world problems are discussed in the framework of hardware and software systems as well as the results from benchmarking experiments. Finally, near term technology trends and future directions of the Pile-of-PCs concept are considered.

high performance distributed computing | 1995

Communication overhead for space science applications on the Beowulf parallel workstation

Thomas L. Sterling; Daniel Savarese; Donald J. Becker; Bruce Fryxell; K. Olson

The Beowulf parallel workstation combines 16 PC-compatible processing subsystems and disk drives using dual Ethernet networks to provide a single-user environment with 1 Gops peak performance, half a Gbyte of disk storage, and up to 8 times the disk I/O bandwidth of conventional workstations. The Beowulf architecture establishes a new operating point in price-performance for single-user environments requiring high disk capacity and bandwidth. The Beowulf research project is investigating the feasibility of exploiting mass market commodity computing elements in support of Earth and space science requirements for large data-set browsing and visualization, simulation of natural physical processes, and assimilation of remote sensing data. This paper reports the findings from a series of experiments for characterizing the Beowulf dual channel communication over-head. It is shown that dual networks can sustain 70% greater throughput than a single network alone but that bandwidth achieved is more highly sensitive to message size than to the number of messages at peak demand. While overhead is shown to be high for global synchronization, its overall impact on scalability of real world applications for computational fluid dynamics and N-body gravitational simulation is shown to be modest.

high performance distributed computing | 1996

A design study of alternative network topologies for the Beowulf parallel workstation

Chance Reschke; Thomas L. Sterling; Daniel Ridge; Daniel Savarese; Donald J. Becker; Phillip Merkey

Coupling PC based commodity technology with distributed computing methodologies provides on important advance in the development of single user dedicated systems. Beowulf is a class of experimental parallel workstations developed to evaluate and characterize the design space of this new operating point in price performance. A key factor determining the realizable performance under real world workloads is the means devised for interprocessor communications. A study has been performed to characterize a family of interconnect topologies feasible with low cost mass market network technologies. Behavior sensitivities to packet size and traffic density are determined. Findings are presented which compare more complex segmented topologies to the earlier parallel channel bonded scheme. It is shown that in many circumstances the more complex topologies perform better, and in some circumstances software routing techniques compare favorably to more expensive hardware switch mechanisms.

international conference on parallel processing | 1996

Achieving a balanced low-cost architecture for mass storage management through multiple fast Ethernet channels on the Beowulf parallel workstation

Thomas L. Sterling; Donald J. Becker; M.R. Berry; Daniel Savarese; C. Reschke

A network of workstations (NOW) seeks to leverage commercial workstation technology to produce high-performance computing systems at costs appreciably lower than parallel computers specifically designed for that purpose. The capabilities of technologies emerging from the PC commodity mass market are rapidly evolving to converge with those of workstations while at significantly lower cost. A new operating point in the price-performance design space of parallel system architecture may be derived through parallelism of PC subsystems. The Pile-of-PCs (PopC) approach is being explored through the Beowulf parallel workstation, developed to provide order-of-magnitude increases in disk capacity and bandwidth for a single-user environment at costs commensurate with conventional high-end workstations. This paper explores a critical aspect of the architecture trade-off space for Beowulf associated with the balance of parallel disk throughput and internal network bandwidth. The findings presented demonstrate that the parallel channels of a commodity 100 Mbps Ethernet are both necessary and sufficient to support the data rates of multiple concurrent file transfers on a 16-processor Beowulf parallel workstation.

international conference on parallel architectures and compilation techniques | 1995

An empirical evaluation of the Convex SPP-1000 hierarchical shared memory system

Thomas L. Sterling; Daniel Savarese; Phillip Merkey; Kevin Olson

Cache coherency in a scalable parallel computer architecture requires mechanisms beyond the conventional common bus based snooping approaches which are limited to about 16 processors. The new Convex SPP-1000 achieves cache coherency across 128 processors through a two-level shared memory NUMA structure employing directory based and SCI protocol mechanisms. While hardware support for managing a common global name space minimizes overhead costs and simplifies programming, latency considerations for remote accesses may still dominate and can under unfavorable conditions constrain scalability. This paper provides the first published evaluation of the SP-1000 hierarchical cache coherency mechanisms from the perspective of measured latency and its impact on basic global How control mechanisms, scaling of a parallel science code, and sensitivity of cache miss rates to system scale. It is shown that global remote access latency is only a factor of seven greater than that of local cache miss penalty and that scaling of a challenging scientific application is not severely degraded by the hierarchical structure for achieving consistency across the system processor caches.

high performance computer architecture | 1995

An initial evaluation of the Convex SPP-1000 for earth and space science applications

Thomas L. Sterling; Daniel Savarese; Phillip Merkey; Jeffrey P. Gardner

The Convex SPP-1000, the most recent SPC, is distinguished by a true global shared memory capability based on the first commercial version of directory based cache coherence mechanisms and SCI protocol. The system was evaluated at NASA/GSFC in the Beta-test environment using three classes of operational experiments targeting earth and space science applications. A multiple program workload tested job-stream level parallelism. Synthetic programs measured overhead costs of barrier, fork-join, and message passing synchronization primitives. An efficient tree-code version of an N-body simulation revealed scaling properties and measured the overall efficiency. This paper presents the results of this study and provides the earliest published evaluation of this new scalable architecture. >

parallel computing | 1996

A Quantitative Approach for Architecture-Invariant Parallel Workload Characterization

Abdullah I. Meajil; Tarek El-Ghazawi; Thomas L. Sterling

Experimental design of parallel computers calls for quantifiable methods to compare and evaluate the requirements of different workloads within an application domain. Such metrics can help establish the basis for scientific design of parallel computers driven by application needs, to optimize performance to cost. In this work, a parallelism-based framework is presented for representing and comparing workloads, based on the way they would exercise parallel machines. This method is architecture-invariant and can be used effectively for the comparison of workloads and assessing resource requirements. Our workload characterization is derived from parallel instruction centroid and parallel workload similarity. The centroid is a workload approximation which captures the type and amount of parallel work generated by the workload on the average. The centroid is an efficient measure which aggregates average parallelism, instruction mix, and critical path length. When captured with abstracted information about communication requirements, the result is a powerful tool in understanding the requirements of workloads and their potential performance on target parallel machines. The parallel workload similarity is based on measuring the normalized Euclidean distance (ned), which provides an efficient means of comparing workloads, between workload centroids. This provides the basis for quantifiable analysis of workloads to make informed decisions on the composition of parallel benchmark suites. It is shown that this workload characterization method outperforms comparable ones in accuracy, as well as in time and space requirements. Analysis of the NAS Parallel Benchmark workloads and their performance is presented to demonstrate some of the applications and insight provided by this framework. The parallel-instruction workload model is used to study the similarities among the NAS Parallel Benchmark workloads in a quantitative manner. The results confirm that workloads in NPB represent a wide range of non-redundant benchmarks with different characteristics.

conference on high performance computing (supercomputing) | 1995

A Performance Evaluation of the Convex SPP-1000 Scalable Shared Memory Parallel Computer

Thomas L. Sterling; Daniel Savaresse; Peter MacNeice; Kevin Olson; Clark Mobarry; Bruce Fryxell; Phillip Merkey

The Convex SPP-1000 is the first commercial implementation of a new generation of scalable shared memory parallel computers with full cache coherence. It employs a hierarchical structure of processing communication and memory name-space management resources to provide a scalableNUMA environment. Ensembles of 8 HP PA-RISC7100 microprocessorsemploy an internal cross-bar switch and directory based cache coherence scheme to provide a tightly coupled SMP.Up to 16 processing ensembles are interconnected by a 4 ring network incorporating a full hardware implementation of the SCI protocol for a full system configuration of 128 processors. This paper presents the findings of a set of empirical studies using both synthetic test codes and full applications for the Earth and space sciences to characterize the performance properties of this new architecture. It is shown that overhead and latencies of global primitive mechanisms, while low in absolute time, are significantly more costly than similar functions local to an individual processor ensemble.

Digest of Papers. Compcon Spring | 1993

Findings from the Pasadena Workshop on HPC software technology

Thomas L. Sterling

The Pasadena Workshop on System Software and Tools for High Performance Computing (HPC) Environments was held at the Jet Propulsion Laboratory in April 1992. Over a hundred experts in related fields from industry, academia, and government were invited to participate in this three-day forum to assess the current status of software technology in support of HPC systems. The goal was to provide a basis from which new directions in research and development for software technology could be established to enable and accelerate the effective application of MPPs to Grand Challenge problems. Attention was given both to immediate practical considerations concerning current tools and to longer-term issues related to future teraFLOPS computing. The author summarizes the findings and their implications for future directions in HPC software technology developments.<<ETX>>

international conference on parallel processing | 1995