Robert K. Montoye | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert K. Montoye is active.

Explore More

Publication

Featured researches published by Robert K. Montoye.

IEEE Journal of Solid-state Circuits | 2008

An 8T-SRAM for Variability Tolerance and Low-Voltage Operation in High-Performance Caches

Leland Chang; Robert K. Montoye; Yutaka Nakamura; Kevin A. Batson; Richard J. Eickemeyer; Robert H. Dennard; Wilfried Haensch; Damir A. Jamsek

An eight-transistor (8T) cell is proposed to improve variability tolerance and low-voltage operation in high-speed SRAM caches. While the cell itself can be designed for exceptional stability and write margins, array-level implications must also be considered to achieve a viable memory solution. These constraints can be addressed by modifying traditional 6T-SRAM techniques and conceding some design complexity and area penalties. Altogether, 8T-SRAM can be designed without significant area penalty over 6T-SRAM while providing substantially improved variability tolerance and low-voltage operation with no need for secondary or dynamic power supplies. The proposed 8T solution is demonstrated in a high-performance 32 kb subarray designed in 65 nm PD-SOI CMOS that operates at 5.3 GHz at 1.2 V and 295 MHz at 0.41 V.

Ibm Journal of Research and Development | 1990

Design of the IBM RISC System/6000 floating-point execution unit

Robert K. Montoye; Erdem Hokenek; Stephen Larry Runyon

The IBM RISC System/6000 (RS/6000) floating-point unit (FPU) exemplifies a second-generation RISC CPU architecture and an implementation which greatly increases floating-point performance and accuracy. The key feature of the FPU is a unified floating-point multiply-add-fused unit (MAF) which performs the accumulate operation ({ital A} {times} {ital B}) + {ital C} as an indivisible operation. This single functional unit reduces the latency for chained floating-point operations, as well as rounding errors and chip busing. It also reduces the number of adders/normalizers by combining the addition required for fast multiplication with accumulation. The MAF unit is made practical by a unique fast-shifter, which eases the overlap of multiplication and addition, and a leading-zero/one anticipator, which eases overlap of normalization and addition. The accumulate instruction required by this architecture reduces the instruction path length by combining two instructions into one. Additionally, the RS/6000 FPU is tightly coupled to the rest of the CPU, unlike typical floating-point coprocessor chips.

custom integrated circuits conference | 2011

A 45nm CMOS neuromorphic chip with a scalable architecture for learning in networks of spiking neurons

Jae-sun Seo; Bernard Brezzo; Yong Liu; Benjamin D. Parker; Steven K. Esser; Robert K. Montoye; Bipin Rajendran; Jose A. Tierno; Leland Chang; Dharmendra S. Modha; Daniel J. Friedman

Efforts to achieve the long-standing dream of realizing scalable learning algorithms for networks of spiking neurons in silicon have been hampered by (a) the limited scalability of analog neuron circuits; (b) the enormous area overhead of learning circuits, which grows with the number of synapses; and (c) the need to implement all inter-neuron communication via off-chip address-events. In this work, a new architecture is proposed to overcome these challenges by combining innovations in computation, memory, and communication, respectively, to leverage (a) robust digital neuron circuits; (b) novel transposable SRAM arrays that share learning circuits, which grow only with the number of neurons; and (c) crossbar fan-out for efficient on-chip inter-neuron communication. Through tight integration of memory (synapses) and computation (neurons), a highly configurable chip comprising 256 neurons and 64K binary synapses with on-chip learning based on spike-timing dependent plasticity is demonstrated in 45nm SOI-CMOS. Near-threshold, event-driven operation at 0.53V is demonstrated to maximize power efficiency for real-time pattern classification, recognition, and associative memory tasks. Future scalable systems built from the foundation provided by this work will open up possibilities for ubiquitous ultra-dense, ultra-low power brain-like cognitive computers.

Proceedings of the IEEE | 2010

Practical Strategies for Power-Efficient Computing Technologies

Leland Chang; David J. Frank; Robert K. Montoye; Steven J. Koester; Brian L. Ji; Paul W. Coteus; Robert H. Dennard; Wilfried Haensch

After decades of continuous scaling, further advancement of silicon microelectronics across the entire spectrum of computing applications is today limited by power dissipation. While the trade-off between power and performance is well-recognized, most recent studies focus on the extreme ends of this balance. By concentrating instead on an intermediate range, an ~ 8× improvement in power efficiency can be attained without system performance loss in parallelizable applications-those in which such efficiency is most critical. It is argued that power-efficient hardware is fundamentally limited by voltage scaling, which can be achieved only by blurring the boundaries between devices, circuits, and systems and cannot be realized by addressing any one area alone. By simultaneously considering all three perspectives, the major issues involved in improving power efficiency in light of performance and area constraints are identified. Solutions for the critical elements of a practical computing system are discussed, including the underlying logic device, associated cache memory, off-chip interconnect, and power delivery system. The IBM Blue Gene system is then presented as a case study to exemplify several proposed directions. Going forward, further power reduction may demand radical changes in device technologies and computer architecture; hence, a few such promising methods are briefly considered.

IEEE Journal of Solid-state Circuits | 1990

Second-generation RISC floating point with multiply-add fused

Erdem Hokenek; Robert K. Montoye; Peter W. Cook

A 440000-transistor second-generation RISC (reduced instruction set computer) floating-point chip is described. The pipeline latency is only two cycles, and a double-precision result is produced every cycle. System throughput and accuracy are increased by using a floating-point multiply-add-fused unit, which carries out a double-precision accumulate as a two-cycle pipelined execution with only one rounding error. While the cycle time (40 ns) is competitive with other CMOS RISC systems, the floating-point performance stretches to the range of bipolar RISC systems (7.4-13 MFLOPS LINPACK). Leading zero anticipation makes the two-cycle pipeline possible by nearly eliminating the additional postnormalization time, and it allows for reduced overall system latency. Partial decode shifters allow complete time sharing for the multiply and data alignment. Improved design techniques for logarithmic addition and higher order counters for multiplication complete this second-generation RISC floating-point unit design. >

symposium on vlsi circuits | 2010

A fully-integrated switched-capacitor 2∶1 voltage converter with regulation capability and 90% efficiency at 2.3A/mm 2

Leland Chang; Robert K. Montoye; Brian L. Ji; Alan J. Weger; Kevin Stawiasz; Robert H. Dennard

A switched-capacitor DC-DC voltage converter in 45nm SOI CMOS leverages on-chip trench capacitors to achieve 90% efficiency at an output of 2.3A/mm2 for 2V-to-0.95V conversion at 100MHz. Operation in step-up and step-down modes is demonstrated. Combined with stacked voltage domains, self-regulation capability enables further efficiency improvement.

Ibm Journal of Research and Development | 1990

Leading-zero anticipator (LZA) in the IBM RISC System/6000 floating-point execution unit

Erdem Hokenek; Robert K. Montoye

This paper presents a technique used in the multiply-add-fused (MAF) unit of the IBM RISC System/6000 (RS/6000) processor for normalizing the floating-point results. Unlike the conventional procedures applied thus far, the so-called leading-zero anticipator (LZA) of the RS/6000 carries out processing of the leading zeros and ones in parallel with floating-point addition. Therefore, the new circuitry reduces the total latency of the MAF unit by enabling the normalization and addition to take place in a single cycle.

symposium on vlsi circuits | 2007

A 5.3GHz 8T-SRAM with Operation Down to 0.41V in 65nm CMOS

Leland Chang; Yutaka Nakamura; Robert K. Montoye; Jun Sawada; Andrew K. Martin; Kiyofumi Kinoshita; Fadi H. Gebara; Kanak B. Agarwal; Dhruva Acharyya; Wilfried Haensch; Kohji Hosokawa; Damir A. Jamsek

A 32 kb subarray demonstrates practical implementation of a 65 nm node 8T-SRAM cell for variability tolerance in highspeed caches. Ideal cell stability allows single-supply operation down to 0.41 V at 295 MHz without dynamic voltage techniques. Despite a larger cell, array area is competitive with 6T-SRAM due to higher array efficiency. With an LSDL decoder, a gated diode sense amplifier, and design tradeoffs enabled by the 8 T cell, 5.3 GHz operation at 1.2 V is achieved.

Ibm Journal of Research and Development | 1998

A decompression core for powerPC

Timothy Michael Kemp; Robert K. Montoye; Jeffrey D. Harper; John Davis Palmer; Daniel J. Auerbach

Code size efficiency is a critical parameter in the design of computer systems for embedded applications. This paper describes a method for improving code size efficiency involving the use of compression techniques to reduce the size of the stored code, and on-the-fly hardware decompression at full processor speed for execution. A simple frequency-based encoding scheme for PowerPC® code achieves a typical code size reduction to 60% of the original size. A corresponding decompression core has been implemented for an embedded microprocessor, such as the PowerPC 401TM. The compression/decompression scheme operates in a manner transparent to the processor and requires no changes to such tools as compilers, linkers, and loaders.

Ibm Journal of Research and Development | 1990

The IBM RISC System/6000 processor: hardware overview

H. B. Bakoglu; Gregory F. Grohoski; Robert K. Montoye

A highly concurrent superscalar second-generation family of RISC workstations and servers is described. The RISC System/6000 family is based on the new IBM POWER (performance optimization with enhanced RISC) architecture; the hardware implementation takes advantage of this powerful RISC architecture and employs sophisticated design techniques to achieve a short cycle time and a low cycles-per-instruction (CPI) ratio. The RS/6000 CPU features multiple-instruction dispatch, multiple functional units that operate concurrently, separate instruction and data caches, and zero-cycle branches. In this superscalar implementation, at a given cycle the equivalent of five operations can be executed simultaneously ( a branch, a condition-register operation, and a floating-point multiply-add).

Explore More