Thomas M. Stricker | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas M. Stricker is active.

Explore More

Publication

Featured researches published by Thomas M. Stricker.

acm symposium on parallel algorithms and architectures | 1994

An architecture for optimal all-to-all personalized communication

Susan Hinrichs; Corey Kosak; David R. O'Hallaron; Thomas M. Stricker; Riichiro Take

In all-to-all personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPC is an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremely dense communication pattern, AAPC causes congestion in many types of networks and therefore executes very poorly on general purpose, asynchronous message passsing routers.nWe present and evaluate a network architecture that executes all-to-all communication optimally on a two-dimensional torus. The router combines optimal partitions of the AAPC step with a self-synchronizing switching mechanism integrated into a conventional wormhole router. Optimality is achieved by routing along shortest paths while fully utilizing all links. A simple hardware addition for synchronized message switching can guarantee optimal AAPC routing in many existing network architectures.nThe flexible communication agent of the iWarp VLSI component allowed us to implement an efficient prototype for the evaluation of the hardware complexity as well as possible software overheads. The measured performance on an 8 × 8 torus exceeded 2 GigaBytes/sec or 80% of the limit set by the raw speed of the interconnects. We make a quantitative comparison of the AAPC router with a conventional message passing system. The potential gain of such a router for larger parallel programs is illustrated with the example of a two-dimensional Fast Fourier Transform.

international symposium on computer architecture | 1995

Optimizing memory system performance for communication in parallel computers

Thomas M. Stricker; Thomas R. Gross

Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth. Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter-node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon.

international conference on supercomputing | 1995

Decoupling synchronization and data transfer in message passing systems of parallel computers

Thomas M. Stricker; James M. Stichnoth; David R. O'Hallaron; Susan Hinrichs; Thomas R. Gross

Synchronization is an important issue for the design of a scalable parallel computer, and some systems include special hardware support for control messages or barriers. The cost of synchronization has a high impact on the design of the message passing (communication) services. In this paper, we investigate three different communication libraries that are tailored toward the synchronization services available: (1) a version of generic send-receive message passing (PVM), which relies on traditional flow control and buffering to synchronize the data transfers; (2) message passing with pulling, i.e. a message is transferred only when the recipient is ready and requests it (as, e.g., used in NX for large messages); and (3) the decoupled direct deposit message passing that uses separate, global synchronization to ensure that nodes send messages only when the message data can be deposited directly into the final destination in the memory of the remote recipient. Measurements of these three styles on a Cray T3D demonstrate the benefits of the decoupled message passing with direct deposit. The performance advantage of this style is made possible by (1) preemptive synchronization to avoid unnecessary copies of the data, (2) high-speed barrier synchronization, and (3) improved congestion control in the network. The designers of the communication system of future parallel computers are therefore strongly encouraged to provide good synchronizationfacilities in addition to high throughput data transfers to support high performance message passing.

international conference on cluster computing | 2000

Partition repositories for partition cloning OS independent software maintenance in large clusters of PCs

Felix Rauch; Christian Kurmann; Thomas M. Stricker

As a novel approach to software maintenance in large clusters of PCs requiring multiple OS installations we implemented partition cloning and partition repositories as well as a set of OS independent tools for software maintenance using entire partitions, thus providing a clean abstraction of all operating system configuration states. We identify the evolution of software installations (different releases) and the customization of installed systems (different machines) as two orthogonal axes. Using this analysis we devise partition repositories as an efficient, incremental storage scheme to maintain all necessary partition images for versatile, large clusters of PCs. We evaluate our approach with a release history of sample images used in the Patagonia multi-purpose clusters at ETH Zurich including several Linux, Windows NT and Oberon images. The study includes quantitative data that shows the viability of the OS independent approach of working with entire partitions and investigates some relevant tradeoffs: e.g., between difference granularity and compression block size.

high performance distributed computing | 2000

Speculative defragmentation - a technique to improve the communication software efficiency for Gigabit Ethernet

Christian Kurmann; Michel G. Muller; Felix Rauch; Thomas M. Stricker

Cluster platforms offer good computational performance, but they still cannot utilize the potential of Gbit/s communication technology. While the speed of the Ethernet has grown to 1 Gbit/s, the functionality and the architectural support in the network interfaces has remained the same for more than a decade, so that the memory system becomes a limiting factor. To sustain the raw network speed in applications, a zero-copy network interface architecture would be required, but, for all widely used stacks, a last copy is required for the (de)fragmentation of the transferred network packets, since Ethernet packets are smaller than a page size. Correctly defragmenting packets of various communication protocols in hardware is an extremely complex task. We therefore consider a speculative defragmentation technique that can eliminate the last defragmenting copy operation in zero-copy TCP/IP stacks on existing hardware. The payload of fragmented packets is separated from the headers and stored in a memory page that can be mapped directly to its final destination in user memory. To evaluate our ideas, we integrated a network interface driver with speculative defragmentation into an existing protocol stack and added well-known page remapping and fast buffer strategies. Measurements indicate that we can improve the performance for a Gigabit Ethernet over a standard Linux 2.2 TCP/IP stack by a factor of 1.5-2 for uninterrupted burst transfers. Furthermore, our study demonstrates good speculation success rates for a database and a scientific application code on a cluster of PCs.

distributed memory computing conference | 1991

Message Routing On Irregular 2d-meshes And Tori

Thomas M. Stricker

Wormhole message routing is supported by the communication hardware of several distributed memory machines. This particular method of message routing has numerous advantages but creates the problem of a routing deadlock. When long messages compete for the same channels in the network, some messages will be blocked until the the first message is fully consumed by the processor at the destination of the message. A deadlock occurs if a set of messages mutually blocks, and no message can progress towards its destination. Most deadlock free routing schemes previously known are designed to work on regular binary hypercubes. Regular hypercubes and meshes are just a special case of networks. However, these routing schemes do not provide enough flexibility to deal with irregular 2-D-tori and with attached auxiliary cells, which can be found on many newer parallel systems. To handle irregular topologies elegantly, a simple proof is necessary to verify the router code. The new proof given in this report is carried out directly on the network graph. It is constructive in the sense that it reveals the design options to deal with irregularities and shows how additional flexibility can be used to achieve better load balancing. Based on the modified routing model, a set of deadlock free router functions relevant to the iWarp system configurations are described and proven to be correct.

high-performance computer architecture | 1997

Global address space, non-uniform bandwidth: a memory system performance characterization of parallel systems

Thomas M. Stricker; T. Cross

Many parallel systems offer a simple view of memory: all storage cells are addressed uniformly. Despite a uniform view of the memory, the machines differ significantly in their memory system performance (and may offer slightly different consistency models). Cached and local memory accesses are much faster than remote read accesses to data generated by another processor or remote write to data intentionally pushed to memories close to another processor. The bandwidth from/to cache and local memory can be an order of magnitude (or more) higher than the bandwidth to/from remote memory. The situation is further complicated by the heavy influence of the access pattern (i.e. the spatial locality of reference) on both the local and the remote memory system bandwidth. In these modern machines, a compiler for a parallel system is faced with a number of options to accomplish a data transfer most efficiently. The decision for the best option requires a cost benefit model, obtained in an empirical evaluation of the memory system performance. We evaluate three DEC Alpha based parallel systems, to demonstrate the practicality of this approach. The common DEC-Alpha processor architecture facilitates a direct comparison of memory system performance. These systems are the DEC 8400, the Cray T3D, and the Cray T3E. The three systems differ in their clock speed, their scalability and in the amount of coherency they provide.

acm symposium on parallel algorithms and architectures | 1992

Supporting the hypercube programming model on mesh architectures: (a fast sorter for iWarp tori)

Thomas M. Stricker

Many combinatorial problems have simple solutions for parallel processing on highly-connected networks such as the butterfly or the hypercube, whereas the fastest processor-to-processor interconnections are realized in parallel machines with low dimensional mesh or torus topology. This paper presents a method for mapping binary hypercubealgorithms onto lower dimensional meshes and analyzes this method in a model derived from the architecture of modern mesh machines. We outline the criteria used to evaluate graph embeddings for mapping supercomputer communication networks. Our work was motivated by the need for fast library routines to do parallel sorting, fast Fouriertransformation and processor synchronization. During the design effort of these building blocks, we developed and analyzed a new technique to support a hypercube network embedded onto a two dimensional torus. A direct implementation of the embedding is made possible by logical channels and pathways. A fast merge sorter based on the bitonic network serves as an example to show how a simple hypercube algorithm can outperform most of the asymptotically optimal mesh algorithms for practical machine sizes. In the conventional mesh computation model, processors are allowed to exchange one unit of data with a neighbor in each step. This model needs to be refined since modern mesh computers, such as the iWarp system, have hardware support for fast non-neighbor communication. The bitonic merge sort, a simple hypercube algorithm, contains a fair amount of fine grain parallelism not found in standard mesh algorithms. This form of parallelism includes pipelined communication, computation overlapped with communication, use of wide instruction words and operands directly read from the communication system through systolic gates. The measured sorting rate of more than 2 10 keys/sec on an iWarp torus with just 64 processors shows the excellent absolute performance of our approach. The performance results compare The research was supported in part by the Defense Advanced Research Project Agency, US Department of Defense, monitored by the Space and Naval Warfare Systems Command under Contract N00039-87-C-0251,and in part by the Office of Naval Research under Contracts N00014-87-K-0385 and N00014-87-K-0533. well with much larger parallel computers. In our analysis of the relative performance we compare our approach to different sorting methods on meshes. The mapped hypercube algorithm is shown to be best for a wide range of machine and problem sizes. For the readers mainly interested in complexity results, our approach may seem somewhat surprising, but the analysis of the algorithm in an accurate model for the iWarp machine shows how good speed and good parallel efficiency is obtained from both forms of parallelism, large and fine grain.

Cluster Computing | 2001

Speculative Defragmentation – Leading Gigabit Ethernet to True Zero-Copy Communication

Christian Kurmann; Felix Rauch; Thomas M. Stricker

Clusters of Personal Computers (CoPs) offer excellent compute performance at a low price. Workstations with “Gigabit to the Desktop” can give workers access to a new game of multimedia applications. Networking PCs with their modest memory subsystem performance requires either extensive hardware acceleration for protocol processing or alternatively, a highly optimized software system to reach the full Gigabit/sec speeds in applications. So far this could not be achieved, since correctly defragmenting packets of the various communication protocols in hardware remains an extremely complex task and prevented a clean “zero-copy” solution in software. We propose and implement a defragmenting driver based on the same speculation techniques that are common to improve processor performance with instruction level parallelism. With a speculative implementation we are able to eliminate the last copy of a TCP/IP stack even on simple, existing Ethernet NIC hardware. We integrated our network interface driver into the Linux TCP/IP protocol stack and added the well known page remapping and fast buffer strategies to reach an overall zero-copy solution. An evaluation with measurement data indicates three trends: (1) for Gigabit Ethernet the CPU load of communication can be reduced processing significantly, (2) speculation will succeed in most cases, and (3) the performance for burst transfers can be improved by a factor of 1.5–2 over the standard communication software in Linux 2.2. Finally we can suggest simple hardware improvements to increase the speculation success rates based on our implementation.

international parallel and distributed processing symposium | 2002

Performance characterization of a molecular dynamics code on PC clusters: is there any easy parallelism in CHARMM?

Egon Perathoner; Andrea Cavalli; Amedeo Caflisch; Thomas M. Stricker

The molecular dynamics code CHARMM is a popular research tool for computational biology. An increasing number of researchers are currently looking for affordable and adequate platforms to execute CHARMM or similar codes. To address this need, we analyze the resource requirements of a CHARMM molecular dynamics simulation on PC clusters with a particle mesh Ewald (PME) treatment of long range electrostatics, and investigate the scalability of the short-range interactions and PME separately. We look at the workload characterization and the performance gain of CHARMM with different network technologies and different software infrastructures and show that the performance depends more on the software infrastructures than on the hardware components. In the present study, powerful communication systems like Myrinet deliver performance that comes close to the MPP supercomputers of the past decade, but improved scalability can also be achieved with better communication system software like SCore without the additional hardware cost. The experimental method of workload characterization presented can be easily applied to other codes. The detailed performance figures of the breakdown of the calculation into computation, communication and synchronization allow one to derive good estimates about the benefits of moving applications to novel computing platforms such as widely distributed computers (grid).

Explore More