Henry S. Warren | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Henry S. Warren is active.

Explore More

Publication

Featured researches published by Henry S. Warren.

Communications of The ACM | 1975

A modification of Warshall's algorithm for the transitive closure of binary relations

Henry S. Warren

An algorithm is given for computing the transitive closure of a binary relation that is represented by a Boolean matrix. The algorithm is similar to Warshalls although it executes faster for sparse matrices on most computers, particularly in a paging environment.

high-performance computer architecture | 2002

Evaluation of a multithreaded architecture for cellular computing

Călin Caşcaval; José G. Castaños; Luis Ceze; Monty M. Denneau; Manish Gupta; Derek Lieber; José E. Moreira; Karin Strauss; Henry S. Warren

Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.

ACM Sigarch Computer Architecture News | 2003

Dissecting Cyclops: a detailed analysis of a multithreaded architecture

George S. Almasi; Cǎlin Caşcaval; José G. Castaños; Monty M. Denneau; Derek Lieber; José E. Moreira; Henry S. Warren

Multiprocessor systems-on-a-chip offer a structured approach to managing complexity in chip design. Cyclops is a new family of multithreaded architectures which integrates processing logic, main memory and communications hardware on a single chip. Its simple, hierarchical design allows the hardware architect to manage a large number of components to meet the design constraints in terms of performance, power or application domain.This paper evaluates several alternative Cyclops designs with different relative costs and trade-offs. We compare the performance of several scientific kernels running on different configurations of this architecture. We show that by increasing the number of threads sharing a floating point unit we can hide fairly high cache and memory latencies. We prove that we can reach the theoretical peak performance of the chip and we identify the optimal balance of components for each application. We demonstrate that the design is well adapted to solve problems that are difficult to optimize. For example, we show that sparse matrix vector multiplication obtains 16 GFlops out of 32 GFlops of peak performance.

Communications of The ACM | 1977

Functions realizable with word-parallel logical and two's-complement addition instructions

Henry S. Warren

Clearly, the main conclusion of the exercise is that there is more than one way to implement recursion using stacks. The timing estimates can, of course, be criticized on the grounds that the actual computation time of N(x) and M(x) may completely dominate the amount of time spent in manipulating stacks. On the other hand, it is easy to calculate the contribution to the timing estimates of the procedure calls N(x) and M(x) , and the instructions involving px, fx , and gx. Picture the computation of S(x) in terms of a binary tree with root x. There are n internal nodes and (n--F1) external nodes in this tree, so there are n calculations each of N and M, (2n,q1) tests involving p, and 2n assignments involving f or g. This gives a total of (6nq-1) units. It is worth mentioning that the charging scheme we have adopted, whereby each basic test or assignment counts 1 unit of time irrespective of its type, can be modified without affecting the relative merits of the solutions. For instance, it may be more realistic to charge the stack operations x ~ A and A ~ x as 2 units each, 1 unit to store or retrieve x and 1 unit to change the stack pointer. Although the running times change, the improvements remain improvements. Although we have only considered stacks, it would be interesting to know whether faster solutions can be obtained using other data structures, such as queues or general arrays. Equally interesting is the possibility of obtaining some theoretical lower bound on the running time of a solution to (1).

Acta Informatica | 1978

Static main storage packing problems

Henry S. Warren

SummaryThe instruction set of many computers permits referencing certain areas of main storage more efficiently than others. For example, “base-offset” addressing favors small offsets. This report discusses the problem of how to optimally assign data to storage on such a machine, subject to the restriction that the locations chosen are not to change with time. The emphasis is on truly optimal solutions, although many simplifying assumptions are made. Some of the results apply to the problem of optimally placing “read-only” files on auxiliary storage. Areas for further work are suggested.

Ibm Systems Journal | 2001

Blue Gene: a vision for protein science using a petaflop supercomputer

Frances E. Allen; George S. Almasi; Wanda Andreoni; D. Beece; B. J. Berne; Arthur A. Bright; José R. Brunheroto; Călin Caşcaval; José G. Castaños; Paul W. Coteus; Paul G. Crumley; Alessandro Curioni; Monty M. Denneau; Wilm E. Donath; Maria Eleftheriou; Blake G. Fitch; B. Fleischer; C. J. Georgiou; Robert S. Germain; Mark E. Giampapa; Donna L. Gresh; Manish Gupta; Ruud A. Haring; H. Ho; Peter H. Hochschild; Susan Flynn Hummel; T. Jonas; Derek Lieber; G. Martyna; K. Maturu

Archive | 1985