Monty M. Denneau | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Monty M. Denneau is active.

Explore More

Publication

Featured researches published by Monty M. Denneau.

high-performance computer architecture | 2002

Evaluation of a multithreaded architecture for cellular computing

Călin Caşcaval; José G. Castaños; Luis Ceze; Monty M. Denneau; Manish Gupta; Derek Lieber; José E. Moreira; Karin Strauss; Henry S. Warren

Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.

international parallel processing symposium | 1994

Architecture and implementation of Vulcan

Craig B. Stunkel; Monty M. Denneau; Ben J. Nathanson; Dennis G. Shea; Peter H. Hochschild; M. Tsao; Bulent Abali; Douglas J. Joseph; P.R. Varker

IBMs recently announced Scalable POWERparallel family of systems is based upon the Vulcan architecture, and the currently available 9076 SP1 parallel system utilizes fundamental Vulcan technology. The experimental Vulcan parallel processor is designed to scale to many thousands of microprocessor-based nodes. To support a machine of this size, the nodes and network incorporate a number of unusual features to scale aggregate bandwidth, enhance reliability, diagnose faults, and simplify cabling. The multistage Vulcan network is a unified data and service network driven by a single oscillator. An attempt is made to detect all network errors via cyclic redundancy checking (CRC) and component shadowing. Switching elements contain a dynamically allocated shared buffer for storing blocked packet flits from any input port. This paper describes the key elements of Vulcans hardware architecture and implementation details of the Vulcan prototype.<<ETX>>

ACM Sigarch Computer Architecture News | 2003

Dissecting Cyclops: a detailed analysis of a multithreaded architecture

George S. Almasi; Cǎlin Caşcaval; José G. Castaños; Monty M. Denneau; Derek Lieber; José E. Moreira; Henry S. Warren

Multiprocessor systems-on-a-chip offer a structured approach to managing complexity in chip design. Cyclops is a new family of multithreaded architectures which integrates processing logic, main memory and communications hardware on a single chip. Its simple, hierarchical design allows the hardware architect to manage a large number of components to meet the design constraints in terms of performance, power or application domain.This paper evaluates several alternative Cyclops designs with different relative costs and trade-offs. We compare the performance of several scientific kernels running on different configurations of this architecture. We show that by increasing the number of threads sharing a floating point unit we can hide fairly high cache and memory latencies. We prove that we can reach the theoretical peak performance of the chip and we identify the optimal balance of components for each application. We demonstrate that the design is well adapted to solve problems that are difficult to optimize. For example, we show that sparse matrix vector multiplication obtains 16 GFlops out of 32 GFlops of peak performance.

International Journal of Parallel Programming | 2002

Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer

George S. Almasi; Calin Cascaval; José G. Castaños; Monty M. Denneau; Wilm E. Donath; Maria Eleftheriou; Mark E. Giampapa; C. T. Howard Ho; Derek Lieber; José E. Moreira; Dennis M. Newns; Marc Snir; Henry S. Warren

The IBM Blue Gene/C parallel computer aims to demonstrate the feasibility of a cellular architecture computer with millions of concurrent threads of execution. One of the major challenges in this project is showing that applications can successfully scale to this massive amount of parallelism. In this paper we demonstrate that the simulation of protein folding using classical molecular dynamics falls in this category. Starting from the sequential version of a well known molecular dynamics code, we developed a new parallel implementation that exploited the multiple levels of parallelism present in the Blue Gene/C cellular architecture. We performed both analytical and simulation studies of the behavior of this application when executed on a very large number of threads. As a result, we demonstrate that this class of applications can execute efficiently on a large cellular machine.

midwest symposium on circuits and systems | 2005

Duty cycle measurement and correction using a random sampling technique

Rashed Zafar Bhatti; Monty M. Denneau; Jeffrey Draper

A specific value of duty cycle of an on-chip clock or signal often becomes of extreme significance in VLSI circuits like DRAMs, dynamic/domino pipelined circuits, pipelined analog-to-digital converters (ADC) and serializer/deserializer (SERDES) circuits, which are sensitive to the duty cycle or where operations are synchronized with both transitions of the clock. This paper introduces a novel idea based on a random sampling technique of inferential statistics for measurement and local correction of the duty cycle of high-speed on-chip signals. The high measurement accuracy achievable through the proposed random sampling technique provides a way to correct the duty cycle with a maximum error of less than half the smallest delay resolution unit available for correction. An input signal with duty cycle from 30% to 70% can be adjusted to a wide range of values within this range using a purely digital, area-efficient standard cell based design. Our experimental results gathered though extensive simulations of the proposed circuit manifest a very close correlation to the expected theoretical results.

international conference on supercomputing | 2001

Demonstrating the scalability of a molecular dynamics application on a Petaflop computer

The IBM Blue Gene project has endeavored into the development of a cellular architecture computer with millions of concurrent threads of execution. One of the major challenges of this project is demonstrating that applications can successfully exploit this massive amount of parallelism. Starting from the sequential version of a well known molecular dynamics code, we developed a new application that exploits the multiple levels of parallelism in the Blue Gene cellular architecture. We perform both analytical and simulation studies of the behavior of this application when executed on a very large number of threads. As a result, we demonstrate that this class of applications can execute efficiently on a large cellular machine.

parallel computing | 1993

The GF11 parallel computer

Manoj Kumar; Yurij Andrij Baransky; Monty M. Denneau

Abstract GF11 is a parallel computer operational at IBMs T.J. Watson Research Center. It is based on the SIMD (Single Instruction Multiple Data) model of parallel computing. GF11 attains its peak execution rate of 11.3 GigaFlops by using 566 identical processing elements, each capable of delivering 20 MegaFlops. Each processor has its own 64 Kb static RAM that can access a 32-bit word on each floating point operation, a 2 Mb dynamic RAM that operates at one fourth of the SRAM speed, and a 1 Kb register file that provides four accesses per floating point operation. The processors communicate through a 576×576 Benes network, organized as three stages of 24×24 crossbar switches. The network provides 11.3 Gb/sec of communication bandwidth to the processors and allows the processors to dynamically reconfigure themselves into arrays f various dimensions and sizes or other interesting interconnection patterns such as a tree, hypercube, etc. This configuration can take place on every word transfer without sacrificing the bandwidth. GF11 has several architectural enhancements to circumvent the limitations of the standard SIMD model such as the ability to perform multiple operations in every instruction and the ability to modify the operations occurring within individual processors based on processor specific data. Preliminary benchmarking efforts on some applications indicate that near peak performance can be sustained on most applications, including some that were previously believed to be ill suited SIMD machines. Minimal restructuring of programs and algorithms is required for achieving this performance. The architecture of GF11 is summarized in this paper and the implementations of Finite Element analysis, LU decomposition, Gaussian Elimination, and Fast Fourier Transform are discussed to illustrate GF11s ability to deliver good performance with minor program restructing.

international symposium on circuits and systems | 2006

Phase measurement and adjustment of digital signals using random sampling technique

Rashed Zafar Bhatti; Monty M. Denneau; Jeffrey Draper

This paper introduces a technique to measure and adjust the relative phase of on-chip high speed digital signals using a random sampling technique of inferential statistics. The proposed technique as applied to timing uncertainty mitigation in the signaling of a digital system is presented as an example; the relative phase information is used to minimize the timing skew. The proposed circuit captures the state of the signals under measurement simultaneously at random instants of time and gathers a large sample data to estimate the relative phase between the signals. By carefully premeditating the sample size, the accuracy and confidence of the result can be set to a level as high as desired. Accurately sensed value of relative phase enables the correction circuit to reduce the maximum correction error, less than half the maximum delay resolution unit available for adjustment. A pure standard cell based circuit design approach is used that reduces the overall design time and circuit complexity. The test results of the proposed circuit manifest a very close correlation to the simulated and theoretically expected results. The random sampling unit (RSU) circuit proposed for phase measurement in this paper occupies 3350 (mum)2 area in 130nm technology, which is an order of magnitude smaller than what is required for its analog equivalent in the same technology

Network Processor Design#R##N#Issues and Practices Volume 2 | 2004

Chapter 2 – A Programmable, Scalable Platform for Next-Generation Networking

Christos John Georgiou; Valentina Salapura; Monty M. Denneau

Publisher Summary This chapter describes a scalable parallel network processor architecture for handling next-generation storage networking at line speeds of 10 Gb/s or higher. By using many simple, but general-purpose processors and embedded memory, high levels of processing power per mm2 of silicon area can be achieved, making this architecture ideally suited to the computationally intensive conversion of protocols required by current and emerging storage networks. The coarse-grain parallelization of protocol tasks and synchronization via queues and message passing are a good match for the parallel processor core environment. Simulations show that a chip with fewer than 16 processor cores can easily handle the protocol conversion between 10 Gb/s Fibre Channel and Infiniband networks. Larger configurations are used for the implementation of small computer systems interface protocols.

Computer-aided Design | 1983

Design and implementation of a software simulation engine

Monty M. Denneau; Eric Paul Kronstadt; Gregory F. Pfister

Abstract Gate-level logic simulation takes up more CPU time as system complexity increases. A special-purpose system which can cut verification time by several orders of magnitude is described. The Yorktown Simulation Engine (YSE) is a highly parallel programmable machine which can simulate up to 1 M gates at a speed of over 2000M gate simulations per second. It is estimated that the IBM 3081 processor could have been simulated at over 1 000 instructions per second on YSE. Gate-level logic simulation is reviewed and the architecture and hardware implementation of the YSE is described. The software architecture, including compiler, linker and register-level language translator, Ysetran, architecture, are detailed.

Explore More