Richard Kenner
New York University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Richard Kenner.
symposium on frontiers of massively parallel computation | 1992
Susan R. Dickey; Richard Kenner
A pairwise combining switch has been implemented for use in the 16*16 processor/memory interconnection network of the NYU Ultracomputer prototype. The switch design may be extended for use in very large systems by providing greater combining capability. Methods for doing so are discussed.<<ETX>>
Circuits Systems and Signal Processing | 1987
Susan R. Dickey; Allan Gottlieb; Richard Kenner; Yue Sheng Liu
Serialization of memory access can be a critical bottleneck in shared memory parallel computers. The NYU Ultracomputer, a large-scale MIMD (multiple instruction stream, multiple data stream) shared memory architecture, may be viewed as a column of processors and a column of memory modules connected by a rectangular network of enhanced 2×2 buffered crossbars. These VLSI nodes enable the network to combine multiple requests directed at the same memory location. Such requests include a new coordination primitive, fetch- and-add, which permits task coordination to be achieved in a highly parallel manner. Processing within the network is used to reduce serialization at the memory modules.To avoid large network latency, the VLSI network nodes must be high-performance components. Design tradeoffs between architectural features, asymptotic performance requirements, cycle time, and packaging limitations are complex. This report sketches the Ultracomputer architecture and discusses the issues involved in the design of the VLSI enhanced buffered crossbars which are the key element in reducing serialization.
acm symposium on parallel algorithms and architectures | 1992
Susan R. Dickey; Richard Kenner
Scalability refers to the extent to which a system approaches perfect (linear) speedup with large numbers of processors. Fundamental physical laws limit scalability, but in practical terms, bottlenecks due to serialization points at critical sections are often the most immediate limitation on the scalability y of an architecture. The NYU Ultracomputer project has developed algorithms without software critical sections that depend on an interconnection network with hardware combining. Such a network has the ability to merge concurrent hot-spot requests into a single memory reference without introducing significant additional delay. Hardware combining is not without cost, but our experience in implementing a combining switch indicates that the cost is much less than is widely believed. We describe the design choices made in implementing the switch to keep network bandwidth high and latency low. We compare the cost of a combining switch to that of a non-combining switch and discuss the scalability of the design we have implemented to large numbers of processors. 1 Limits on scalability The term scrdability is conventionally used to refer to the extent to which a computing system maintains a speedup linear with the system size on a range of applications. Scalability is a metric used to compare two architectures, not a binary attribute of an architecture. As an asymptotic measure, it is not relevant in comparing two systems of a given size. The least scalable component of a system limits its scalability. For example, a single system bus, with it= fixed bandwidth, becomes saturated as the number of processors grows. In order to scale, such a system would need bus bandwidth Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and uotice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. increasing linearly with the number of processors. Often the least scalable component is the software. It is commonly believed ( “Amdahl’s Law” ) that there is a fixed portion of many applications that is not parallelizable; such applications impose their own scalability limit. On the other hand, applications which do only a very small amount of communication (such as Monte Carlo systems with no shared data) are perfectly scalable on most architectures and do not permit scalability comparisons. We are interested in those applications that perform 0(1 ) communication requests per processor per unit time. To avoid limiting scalabtlty, the servers in the system (e. g., memory modules) must also each be able to handle O(1) requests, as must the communications net work. Physical laws affect the ultimate scalability of any architecture. The total volume required for a processing system increases at least linearly with N, the number of processors. The restriction that information cannot be transfered faster than the speed of light and must travel across an increasing volume provides a lower bound on communication costs. Other physical limits, such as information density [9], may also be relevant. In machines with a globally-shared, uniformly-accessible memory, this means that memory latency must necessarily grow with machine size. For message-passing systems, it means that the average transit time of the message must similarly grow. For sufficiently large systems containing N processors, this limitation implies a communication time of O(N’f3). These latencies need not affect scalability. For example, an architecture might migrate a computation to the processor from which it next requires data. Such a system, though quite scalable, makes very inefficient use of the communications bandwidth. Alternatively, enough computations may be available at the processor (either through pipelining or multiple contexts) to permit work to be performed while waiting for a reply. In this case, the size of the processor is no longer constant and, for sufficiently large system sizes, would result in O(@) contexts per processor with an equal latency. In any such case, the number of concurrent communications must be increased from O(N) to O(N * latency) to prevent processor idle time. SPAA ’9261921CA @1992 ACM 0-89791 -484-8/92/()()06/0296 ,......
hawaii international conference on system sciences | 1989
Richard Kenner; Susan R. Dickey; Patricia J. Teller
1.50 296 disproportionate share of memory requests. If these requests are serviced serially, a bottleneck arises, To further understand proposed solutions to the hot-spot ,,, +4 FPC(O,O)~ ~RPC(I),O)~ . . problem, consider a system with a log N stage interconnecm JEEiit FPC(O,l) I+
tri-ada | 1994
Richard Kenner
A description is given of the ways in which the environment of a highly parallel, high-latency interconnection network is different from that encountered in a uniprocessor system. The impact of these differences on the design of the processing elements is discussed. Methods that can be used to evaluate the impact of architectural choices on the performance of any system that uses a similar network are examined. Two detailed designs of processing elements, one using a CISC (complex-instruction-set computer) processor and the other using a RISC (reduced-instruction-set computer) are given as examples.<<ETX>>
Vlsi Design | 1995
Susan R. Dickey; Richard Kenner
The GNAT compiler [GNAT94] couples an Ada 9X front end with the GNU C compiler (GCC). Here we describe the manner in which the GNAT front end is integrated into the GCC compilation system. We present a summary of GCC and the GNAT front end, describe the equivalences between GNAT and GCC objects, discuss the operation of Gigi, the GNAT to GCC “converter”, list changes required to GCC, and finally present the issues of exceptions and checked arithmetic.
ieee computer society international conference | 1989
Richard Kenner; Ronald Bianchini; Susan R. Dickey; Patricia J. Teller
We present the design for the two VLSI components used in a processor-to-memory interconnection network for a shared memory system. These components allow the combining of requests that are destined to the same memory location. The design contains both semi-systolic queues and an associative “wait buffer.” Transition equations and schematics of the critical pieces of the design are included.
Archive | 1985
Susan R. Dickey; Richard Kenner; Marc Snir; Jon A. Solworth
The ultracomputer architecture connects hundreds or possibly thousands of processing elements (PEs), each containing a cache but no local memory, to an equal number of memory modules (MMs) via an interconnection network constructed of custom VLSI components that combine (merge) requests from different PEs destined to the same memory location. The network provides a high-bandwidth path from the PEs to MMs, but the memory latency is significantly larger than that encountered in either a uniprocessor or small bus-based multiprocessor. The PE must utilize the high available bandwidth and minimize the effect of network latency. The authors present a design of a 64-PE ultracomputer prototype using AMD AM29000 CPUs, including a description of system packaging using a backplane-free technology. The prototype is expected to be operational in 1990.<<ETX>>
hawaii international conference on system sciences | 1988
Patricia J. Teller; Richard Kenner; Marc Snir
Archive | 1986
Susan R. Dickey; Richard Kenner; M. Sn