Craig B. Stunkel | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Craig B. Stunkel is active.

Explore More

Publication

Featured researches published by Craig B. Stunkel.

Ibm Systems Journal | 1995

The SP2 high-performance switch

Craig B. Stunkel; Dennis G. Shea; B. Aball; M. G. Atkins; Carl A. Bender; D. G. Grice; Peter H. Hochschild; Douglas J. Joseph; Ben J. Nathanson; R. Swetz; R. F. Stucke; M. Tsao; P.R. Varker

The heart of an IBM SP2™ system is the HighPerformance Switch, which is a low-latency, highbandwidth switching network that binds together RISC System/6000® processors. The switch incorporates a unique combination of topology and architectural features to scale aggregate bandwidth, enhance reliability, and simplify cabling. It is a bidirectional multistage interconnect subsystem driven by a common oscillator, and delivers both data and service packets over the same links. Switching elements contain a dynamically allocated shared buffer for storing blocked packet flits. The switch is constructed primarily from switching elements (the Vulcan switch chip) and adapters (the SP2 communication adapter). The SP2 communication adapter uses a variety of techniques to improve bandwidth and offload communication tasks from the node processor. This paper examines the switch architecture and presents an overview of its support software.

conference on high performance computing (supercomputing) | 2005

On the Feasibility of Optical Circuit Switching for High Performance Computing Systems

Kevin J. Barker; Alan F. Benner; Raymond R. Hoare; Adolfy Hoisie; Darren J. Kerbyson; Dan Li; Rami G. Melhem; Ramakrishnan Rajamony; Eugen Schenfeld; Shuyi Shao; Craig B. Stunkel; Peter A. Walker

The interconnect plays a key role in both the cost and performance of large-scale HPC systems. The cost of future high-bandwidth electronic interconnects mushrooms due to expensive optical transceivers needed between electronic switches. We describe a potentially cheaper and more power-efficient approach to building high-performance interconnects. Through empirical analysis of HPC applications, we find that the bulk of inter-processor communication (barring collectives) is bounded in degree and changes very slowly or never. Thus we propose a two-network interconnect: An Optical Circuit Switching (OCS) network handling long-lived bulk data transfers, using optical switches; and a secondary lower-bandwidth Electronic Packet Switching (EPS) network. An OCS could be significantly cheaper, as it uses fewer optical transceivers than an electronic network. Collectives and transient communication packets traverse the electronic network. We present compiler techniques and dynamic run-time policies, using this two-network interconnect. Simulation results show that our approach provides high performance at low cost.

ieee international conference on high performance computing data and analytics | 1994

The SP1 high-performance switch

Craig B. Stunkel; Dennis G. Shea; D.G. Grice; Peter H. Hochschild; M. Tsao

The IBM scalable POWERparallel systems 9076 SP1 connects RISC System/6000 processors via a communication network called the high-performance switch. This switch-based upon the Vulcan parallel processor incorporates a number of unusual features to enhance reliability, diagnose faults, and simplify cabling. This paper examines the SP1 switch architecture and implementation and overviews the switch support software. The switch is a bidirectional MIN, and provides at least 4 usable redundant paths for most pairs of communicating nodes.<<ETX>>

international parallel processing symposium | 1994

Architecture and implementation of Vulcan

Craig B. Stunkel; Monty M. Denneau; Ben J. Nathanson; Dennis G. Shea; Peter H. Hochschild; M. Tsao; Bulent Abali; Douglas J. Joseph; P.R. Varker

IBMs recently announced Scalable POWERparallel family of systems is based upon the Vulcan architecture, and the currently available 9076 SP1 parallel system utilizes fundamental Vulcan technology. The experimental Vulcan parallel processor is designed to scale to many thousands of microprocessor-based nodes. To support a machine of this size, the nodes and network incorporate a number of unusual features to scale aggregate bandwidth, enhance reliability, diagnose faults, and simplify cabling. The multistage Vulcan network is a unified data and service network driven by a single oscillator. An attempt is made to detect all network errors via cyclic redundancy checking (CRC) and component shadowing. Switching elements contain a dynamically allocated shared buffer for storing blocked packet flits from any input port. This paper describes the key elements of Vulcans hardware architecture and implementation details of the Vulcan prototype.<<ETX>>

international symposium on computer architecture | 1997

Implementing multidestination worms in switch-based parallel systems: architectural alternatives and their impact

Craig B. Stunkel; Rajeev Sivaram; Dhabaleswar K. Panda

Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based parallel systems is non-trivial. In this paper we propose alternative switch architectures with differing buffer organizations to implement multidestination worms on switch-based parallel systems. First, we discuss issues related to such implementation (deadlock-freedom, replication mechanisms, header encoding, and routing). Next, we demonstrate how an existing central-buffer-based switch architecture supporting unicast message passing can be enhanced to accommodate multidestination message passing. Similarly, implementing multidestination worms on an input-buffer-based switch architecture is discussed. Both of these implementations are evaluated against each other as well as against a software-based scheme using the central buffer organization. Simulation experiments under a range of traffic (multiple multicast, bimodal, varying degree of multicast, and message length) and system size are used for evaluation. The study demonstrates the superiority of the central-buffer-based switch architecture. It also indicates that under bimodal traffic the central-buffer-based hardware multicast implementation affects background unicast traffic less adversely compared to a software-based multicast implementation. Thus, multidestination message passing can easily be applied to switch-based parallel systems to deliver good collective communication performance.

IEEE Computer | 1991

Address tracing for parallel machines

Craig B. Stunkel; Bob Janssens; W.K. Fuchs

Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately.<<ETX>>

IEEE Transactions on Parallel and Distributed Systems | 1998

Efficient broadcast and multicast on multistage interconnection networks using multiport encoding

Rajeev Sivaram; Dhabaleswar K. Panda; Craig B. Stunkel

This paper proposes anew approach for implementing fast multicast and broadcast in unidirectional and bidirectional multistage interconnection networks (MINs) with multiport encoded multidestination worms. For a MIN with n stages, such worms use n header flits each. One flit is used for each stage of the network and it indicates the output ports to which a multicast message needs to be replicated. A multiport encoded worm with (d/sub 1/, d/sub 2/..., d/sub n/, 1/spl les/d/sub i//spl les/k) degrees of replication for the respective stages is capable of covering (d/sub 1//spl times/d/sub x//spl times/.../spl times/d/sub n/) destinations with a single communication start-up. In this paper, a switch architecture is proposed for implementing multidestination worms without deadlock. Three grouping algorithms of varying complexity are presented to derive the associated multiport encoded worms for a multicast to an arbitrary set of destinations. Using these worms, a multinomial tree-based scheme is proposed to implement the multicast. This scheme significantly reduces broadcast/multicast latency compared to schemes using unicast messages. Simulation studies for both unidirectional and bidirectional MIN systems indicate that improvement in broadcast/multicast latency up to a factor of four is feasible using the new approach. Interestingly, this approach is able to implement multicast with reduced latency as the number of destinations increases beyond a certain number. Compared to implementing unicast messages, this approach requires little additional logic at the switches. Thus, the scheme demonstrates significant potential for implementing efficient collective communication operations on current and future MIN-based systems.

international conference on parallel processing | 1996

Adaptive source routing in multistage interconnection networks

Yucel Aydogan; Craig B. Stunkel; Cevdet Aykanat; Bülent Abali

We describe the adaptive source routing (ASR) method which is a first attempt to combine adaptive routing and source routing methods. In ASR, the adaptivity of each packet is determined at the source processor. Every packet can be routed in a fully adaptive or partially adaptive or non-adaptive manner, all within the same network at the same time. We evaluate and compare performance of the proposed adaptive source routing networks and oblivious routing networks by simulations. We also describe a route generation algorithm that determines maximally adaptive routes in multistage networks.

international parallel processing symposium | 1997

A reliable hardware barrier synchronization scheme

Rajeev Sivaram; Craig B. Stunkel; Dhabaleswar K. Panda

Barrier synchronization is a crucial operation for parallel systems. Many schemes have been proposed in the literature to achieve fast barrier synchronization through software, hardware, or a combination of these mechanisms. However few of these schemes emphasize fault-tolerant barrier operations. In this paper, we describe inexpensive support that can be added to network switches for achieving reliable hardware-based barrier synchronization while recovering from lost or corrupted messages. Necessary modifications to the switch architecture and the associated fault-tolerant message-passing protocols are presented. The protocols are optimized for the no-fault case while providing means to detect the failure of any step of the operation and to recover from it. The proposed scheme shows significant potential for use in parallel systems, especially the emerging systems based on networks of workstations.

merged international parallel processing symposium and symposium on parallel and distributed processing | 1998

HIPIQS: a high-performance switch architecture using input queuing

Rajeev Sivaram; Craig B. Stunkel; Dhabaleswar K. Panda

Switch-based interconnects are used in a number of application domains including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both extremely low latency and very high throughput for a variety of different message sizes. While some architectures with output queuing have been shown to perform extremely well in terms of throughput, their performance can suffer when used in systems where a significant portion of the packets are extremely small. On the other hand, architectures with input queuing offer limited throughput, or require fairly complex and centralized arbitration that increases latency. We present a new input queue-based switch architecture called HIPIQS (High-Performance Input-Queued Switch). It offers low latency for a range of message sizes, and provides throughput comparable to that of output queuing approaches. Furthermore, it allows simple and distributed arbitration. HIPIQS uses a dynamically allocated multi-queue organization, pipelined access to multi-bank input buffers, and small cross-point buffers, to deliver high performance. Our simulation results show that HIPIQS can deliver performance close to that of output queuing approaches over a range of message sizes, system sizes, and traffic. The switch architecture can therefore be used to build high performance switches that are useful for both parallel system interconnects and for building computer networks.

Explore More