Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Bruce M. Fleischer is active.

Publication


Featured researches published by Bruce M. Fleischer.


Ibm Journal of Research and Development | 2015

Active Memory Cube: A processing-in-memory architecture for exascale systems

Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing


international conference on computer aided design | 1996

Static timing analysis for self resetting circuits

Vinod Narayanan; Barbara Chappell; Bruce M. Fleischer

10^{18}


Ibm Journal of Research and Development | 2004

The IBM eServer z990 floating-point unit

Guenter Gerwig; Holger Wetter; Eric M. Schwarz; Juergen Haess; Christopher A. Krygowski; Bruce M. Fleischer; Michael Kroener

floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.


international solid-state circuits conference | 2006

4GHz+ low-latency fixed-point and binary floating-point execution units for the POWER6 processor

Brian W. Curran; B. McCredie; Leon J. Sigal; Eric M. Schwarz; Bruce M. Fleischer; Yuen H. Chan; D. Webber; M. Vaden; A. Goyal

Static timing analysis techniques are widely used to verify the timing behavior of large digital designs implemented predominantly in conventional static CMOS. These techniques, however, are not sufficient to completely verify the dynamic circuit families now finding favor in high-performance designs. In this paper, we describe an approach that extends static timing analysis to a high-performance dynamic CMOS logic family called self-resetting CMOS (SRCMOS). Due to the circuit structure employed in SRCMOS, designs naturally decompose into a hierarchy of gates and macros; timing analysis must address and preferably exploit this hierarchy. At the gate level, three categories of constraints on pulse timing arise from considering the effects of pulse width, overlap, and collisions. Timing analysis is performed at the macro level, by a) performing timing tests at macro boundaries and b) using macro-level delay models. We define various macro-level timing tests which ensure that fundamental gate-level timing constraints are satisfied. We extend the standard delay model to handle leading and trailing edges of signal pulses, across-chip variations, trading of signals, and slow and fast operating conditions. We have developed an SRCMOS timing analyzer based on this approach; the analyzer implemented as extensions to a standard static timing analysis program, thus facilitating its integration into an existing design system and methodology.


international solid-state circuits conference | 2011

A 4R2W register file for a 2.3GHz wire-speed POWER™ processor with double-pumped write operation

Gary S. Ditlow; Robert K. Montoye; Salvatore N. Storino; Sherman M. Dance; Sebastian Ehrenreich; Bruce M. Fleischer; Thomas W. Fox; Kyle M. Holmes; Junichi Mihara; Yutaka Nakamura; Shohji Onishi; Robert Shearer; Dieter Wendel; Leland Chang

The floating-point unit (FPU) of the IBM z990 eServerTM is the first one in an IBM mainframe with a fused multiply-add dataflow. It also represents the first time that an SRT divide algorithm (named after Sweeney, Robertson, and Tocher, who independently proposed the algorithm) was used in an IBM mainframe. The FPU supports dual architectures: the zSeries® hexadecimal floating-point architecture and the IEEE 754 binary floating-point architecture. Six floating-point formats-- including short, long, and extended operands-are supported in hardware. The throughput of this FPU is one multiply-add operation per cycle. The instructions are executed in five pipeline steps, and there are multiple provisions to avoid stalls in case of data dependencies. It is able to handle denormalized input operands and denormalized results without a stall (except for architectural program exceptions). It has a new extended-precision divide and square-root dataflow. This dataflow uses a radix-4 SRT algorithm (radix-2 for square root) and is able to handle divides and square-root operations in multiple floating-point and fixed-point formats. For fixed-point divisions, a new mechanism improves the performance by using an algorithm with which the number of divide iterations depends on the effective number of quotient bits.


european solid-state circuits conference | 2006

A 5GHz+ 128-bit Binary Floating-Point Adder for the POWER6 Processor

Xiao Yan Yu; Yiu-Hing Chan; Brian W. Curran; Eric M. Schwarz; Michael R. Kelly; Bruce M. Fleischer

A 1-pipe stage, low-latency, 13 FO4, 64b fixed-point execution unit, implemented in a 65nm SOI CMOS process, allows back-to-back execution of data dependent adds, subtracts, compares, shifts, rotates, and logical operations. A 7-pipe stage, 91 FO4, double-precision floating-point unit allows forwarding of dependent results after 6 cycles in most cases


custom integrated circuits conference | 2009

64-bit prefix adders: Power-efficient topologies and design solutions

Ching Zhou; Bruce M. Fleischer; Michael Karl Gschwind; Ruchir Puri

In multi-ported register files, memory cell size grows quadratically with the total number of ports due to wordline and bitline wiring. Reducing the number of physical access ports in a memory cell can thus lead to significant area and power savings as well as latency improvement. Double-pumped register files operate access ports twice in a single clock period to reduce area by halving the number of physical ports in the memory cell — a technique often confined to low-frequency applications. Replication of a memory cell in separate arrays halves the number of physical read ports in each copy. In this work, double-pumped write ports and replicated read ports are applied to a 4R2W register file in a highperformance microprocessor product [1]. This paper describes detailed implementation and measured hardware characteristics of this array and demonstrates a fast error correction scheme. The techniques used balance high efficiency and low latency and thus differ from previous work, in which double-pumped ports perform a write followed by a read in a very large register file [2] or where write ports are double-pumped without cell-level read port reduction [3].


IEEE Transactions on Computers | 2013

Low-Cost Concurrent Error Detection for Floating-Point Unit (FPU) Controllers

Michail Maniatakos; Prabhakar Kudva; Bruce M. Fleischer; Yiorgos Makris

A fast 128-bit end-around carry adder is designed and fabricated as part of the POWER6 floating-point unit in a 65nm SOI process technology. Efficient use of static circuits and careful balance of the look-ahead tree enable our floating point design to operate beyond 5GHz with 1.1 V supply


vlsi test symposium | 2011

Exponent monitoring for low-cost concurrent error detection in FPU control logic

Michail Maniatakos; Yiorgos Makris; Prabhakar Kudva; Bruce M. Fleischer

64-bit adders of various prefix algorithms are designed using a novel dataflow synthesis methodology, Our synthesis methodology offers robust adder solutions typically used for high-performance microprocessor needs. We have analyzed the power-performance tradeoffs for a portfolio of popular adder topologies and design styles. In particular, the intrinsically sparser designs in hierarchical prefix scheme are demonstrated to be preferable choices for both high-performance and lowpower adder applications.


international conference on computer design | 2016

A statistical critical path monitor in 14nm CMOS

Bruce M. Fleischer; Christos Vezyrtzis; Karthik Balakrishnan; Keith A. Jenkins

We present a nonintrusive concurrent error detection (CED) method for protecting the control logic of a contemporary floating-point unit (FPU). The proposed method is based on the observation that control logic errors lead to extensive data path corruption and affect, with high probability, the exponent part of the IEEE-754 floating-point representation. Thus, exponent monitoring can be utilized to detect errors in the control logic of the FPU. Predicting the exponent involves relatively simple operations; therefore, our method incurs significantly lower overhead than the classical approach of duplicating the control logic of the FPU. Indeed, experimental results on the openSPARC T1 processor using SPEC2006FP benchmarks show that as compared to control logic duplication, which incurs an area overhead of 17.9 percent of the FPU size, our method incurs an area overhead of only 5.8 percent yet still achieves detection of over 93 percent of transient errors in the FPU control logic. Moreover, the proposed method offers the ancillary benefit of also detecting 98.1 percent of the data path errors that affect the exponent, which cannot be detected via duplication of control logic. Finally, when combined with a classical residue code-based method for the fraction, our method leads to a complete CED solution for the entire FPU which provides a coverage of 94.1 percent of all errors at an area cost of 16.32 percent of the FPU size.

Researchain Logo
Decentralizing Knowledge