Douglas J. Joseph | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Douglas J. Joseph is active.

Explore More

Publication

Featured researches published by Douglas J. Joseph.

Ibm Systems Journal | 1995

The SP2 high-performance switch

Craig B. Stunkel; Dennis G. Shea; B. Aball; M. G. Atkins; Carl A. Bender; D. G. Grice; Peter H. Hochschild; Douglas J. Joseph; Ben J. Nathanson; R. Swetz; R. F. Stucke; M. Tsao; P.R. Varker

The heart of an IBM SP2™ system is the HighPerformance Switch, which is a low-latency, highbandwidth switching network that binds together RISC System/6000® processors. The switch incorporates a unique combination of topology and architectural features to scale aggregate bandwidth, enhance reliability, and simplify cabling. It is a bidirectional multistage interconnect subsystem driven by a common oscillator, and delivers both data and service packets over the same links. Switching elements contain a dynamically allocated shared buffer for storing blocked packet flits. The switch is constructed primarily from switching elements (the Vulcan switch chip) and adapters (the SP2 communication adapter). The SP2 communication adapter uses a variety of techniques to improve bandwidth and offload communication tasks from the node processor. This paper examines the switch architecture and presents an overview of its support software.

international parallel processing symposium | 1994

Architecture and implementation of Vulcan

Craig B. Stunkel; Monty M. Denneau; Ben J. Nathanson; Dennis G. Shea; Peter H. Hochschild; M. Tsao; Bulent Abali; Douglas J. Joseph; P.R. Varker

IBMs recently announced Scalable POWERparallel family of systems is based upon the Vulcan architecture, and the currently available 9076 SP1 parallel system utilizes fundamental Vulcan technology. The experimental Vulcan parallel processor is designed to scale to many thousands of microprocessor-based nodes. To support a machine of this size, the nodes and network incorporate a number of unusual features to scale aggregate bandwidth, enhance reliability, diagnose faults, and simplify cabling. The multistage Vulcan network is a unified data and service network driven by a single oscillator. An attempt is made to detect all network errors via cyclic redundancy checking (CRC) and component shadowing. Switching elements contain a dynamically allocated shared buffer for storing blocked packet flits from any input port. This paper describes the key elements of Vulcans hardware architecture and implementation details of the Vulcan prototype.<<ETX>>

Ibm Journal of Research and Development | 2001

High-throughput coherence control and hardware messaging in everest

Ashwini K. Nanda; Anthony-Trung Nguyen; Maged M. Michael; Douglas J. Joseph

Everest is an architecture for high-performance cache coherence and message passing in partitionable distributed shared-memory systems that use commodity shared multiprocessors (SMPs) as building blocks. The Everest architecture is intended for use in designing future IBM servers using either PowerPC® or Intel® processors. Everest provides high-throughput protocol handling in three dimensions: multiple protocol engines, split request-response handling, and pipelined design. It employs an efficient directory subsystem design that matches the directory access throughput requirement of highperformance protocol engines. A new directory design called the complete and concise remote (CCR) directory, which contains roughly the same amount of memory as a sparse directory but retains the benefits of a full-map directory, is used. Everest also supports system partitioning and provides a tightly integrated facility for secure, high-performance communication between partitions. Simulation results for both technical and commercial applications exploring some of the Everest design space are presented. The results show that the features of the Everest architecture can have significant impact on the performance of distributed shared-memory servers.

international conference on supercomputing | 2018

On Optimizing Distributed Tucker Decomposition for Sparse Tensors

Venkatesan T. Chakaravarthy; Jee W. Choi; Douglas J. Joseph; Prakash Murali; Shivmaran S. Pandian; Yogish Sabharwal; Dheeraj Sreedhar

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.

international parallel and distributed processing symposium | 2017

On Optimizing Distributed Tucker Decomposition for Dense Tensors

Venkatesan T. Chakaravarthy; Jee W. Choi; Douglas J. Joseph; Xing Liu; Prakash Murali; Yogish Sabharwal; Dheeraj Sreedhar

The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensor-times-matrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to 7x speed-up in the overall running time.

Archive | 2002