Martin S. Schmookler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martin S. Schmookler is active.

Explore More

Publication

Featured researches published by Martin S. Schmookler.

symposium on computer arithmetic | 2001

Leading zero anticipation and detection-a comparison of methods

Martin S. Schmookler; Kevin J. Nowka

Design of the leading zero anticipator (LZA) or detector (LZD) is pivotal to the normalization of results for addition and fused multiplication-addition in high-performance floating point processors. This paper formalizes the analysis and describes some alternative organizations and implementations from the known art. It shows how choices made in the design are often dependent on the overall design of the addition unit, on how subtraction is handled when the exponents are the same, and on how it detects and corrects for the possible one-bit error of the LZA.

IEEE Transactions on Computers | 1971

High Speed Decimal Addition

Martin S. Schmookler; Arnold Weinberger

Parallel decimal arithmetic capability is becoming increasingly attractive with new applications of computers in a multi-programming environment. The direct production of decimal sums offers a significant improvement in addition over methods requiring decimal correction. These techniques are illustrated in the eight-digit adder which appears in the System/360 Model 195.

symposium on computer arithmetic | 2007

P6 Binary Floating-Point Unit

Son Dao Trong; Martin S. Schmookler; Eric M. Schwarz; Michael Kroener

The floating point unit of the next generation PowerPC is detailed. It has been tested at over 5 GHz. The design supports an extremely aggressive cycle time of 13 FO4 using a technology independent measure. For most dependent instructions, its fused multiply-add dataflow has only 6 effective pipeline stages. This is nearly equivalent to its predecessor, the Power 5, even though its technology independent frequency has increased over 70%. Overall the frequency has improved over 100%. It achieves this high performance through aggressive feedback paths, circuit design and layout. The pipeline has 7 stages but data may be fed back to dependent operations prior to rounding and complete normalization. Division and square root algorithms are also described which take advantage of high-precision linear approximation hardware for obtaining a reciprocal or reciprocal square root approximation.

IEEE Transactions on Computers | 2005

FPU implementations with denormalized numbers

Eric M. Schwarz; Martin S. Schmookler; Son Dao Trong

Denormalized numbers are the most difficult type of numbers to implement in floating-point units. They are so complex that certain designs have elected to handle them in software rather than in hardware. Traps to software can result in long execution times, which renders denormalized numbers useless to programmers. This does not have to happen. With a small amount of additional hardware, denormalized numbers and underflows can be handled close to the speed of normalized numbers. This paper summarizes the little known techniques for handling denormalized numbers. Most of the techniques described here only appear in filed or pending patent applications.

symposium on computer arithmetic | 1999

A low-power, high-speed implementation of a PowerPC/sup TM/ microprocessor vector extension

Martin S. Schmookler; M. Putrino; C. Roth; M. Sharma; A. Mather; J. Tyler; H. Van Nguyen; M.N. Pham; J. Lent

The AltiVec/sup TM/ technology is an extension to the PowerPC architecture/sup TM/ which provides new computational and storage operations for handling vectors of various data lengths and data types. The first implementation using this technology is a low-cost, low-power processor based on the acclaimed PowerPC 750/sup TM/ microprocessor. This paper describes the microarchitecture and design of the vector arithmetic unit of this implementation.

Ibm Journal of Research and Development | 1980

Design of large ALUs using multiple PLA macros

Martin S. Schmookler

This paper describes methods of designing large Arithmetic and Logical Units (ALUs) using multiple Programmable Logic Array (PLA) macros in which the outputs are obtained in one cycle corresponding to one pass through any PLA. The design is based on the well-known technique of providing conditional sums and group carries in parallel and selecting the proper sum using gating circuits. The PLA for each group of hits uses an adder design published by Weinberger in which each bit of the sum is formed from the EXCLUSIVEOR of two outputs of the OR array. By placing the gating circuits in front of the EXCLUSIVE-OR circuits, the sums can be obtained using two OR array outputs for each bit and one additional OR array output for each internal string of bits. Also discussed are how ALUs containing more than two groups can obtain the group carries using a separate carry-look-ahead PLA macro cind how this macro can be compressed by using special decoders and special physical design layout techniques. Additionally, the paper demonstrates how the PLAs can he used to provide detection of overflow and of zero results, and to also provide Boolean operations.

symposium on computer arithmetic | 1999

Series approximation methods for divide and square root in the Power3/sup TM/ processor

Ramesh C. Agarwal; Fred G. Gustavson; Martin S. Schmookler

The Power3 processor is a 64-bit implementation of the PowerPC/sup TM/ architecture and is the successor to the Power2/sup TM/ processor for workstations and servers which require high performance floating point capability. The previous processors used Newton-Raphson algorithms for their implementations of divide and square root. The Power3 processor has a longer pipeline latency, which would substantially increase the latency for these instructions. Instead, new algorithms based on power series approximations were developed which provide significantly better performance than the Newton-Raphson algorithm for this processor. This paper describes the algorithms, and then shows how both the series based algorithms and the Newton-Raphson algorithms are affected by pipeline length. For the Power3, the power series algorithms reduce the divide latency by over 20% and the square root latency by 35%.

symposium on computer arithmetic | 2003

Hardware implementations of denormalized numbers

Eric M. Schwarz; Martin S. Schmookler; Son Dao Trong

Denormalized numbers are the most difficult type of numbers to implement in floating-point units. They are so complex that some designs have elected to handle them in software rather than hardware. This has resulted in execution times in the tens of thousands of cycles, which has made denormalized numbers useless to programmers. This does not have to happen. With a small amount of additional hardware, denormalized numbers and underflows can be handled close to the speed of normalized numbers. We will summarize the little known techniques for handling denormalized numbers. Most of the techniques discussed have only been discussed in filed or pending patent applications.

IEEE Transactions on Computers | 1968

High-Speed Binary-to-Decimal Conversion

Martin S. Schmookler

Abstract—This note describes several methods of performing fast, efficient, binary-to-decimal conversion. With a modest amount of circuitry, an order of magnitude speed improvement is obtained. This achievement offers a unique advantage to general-purpose computers requiring special hardware to translate between binary and decimal numbering systems.

IEEE Transactions on Computers | 1969

On Mod-2 Sums of Products

Martin S. Schmookler

An algorithm is provided for obtaining any Boolean function as a modulo-2 sum of products containing only uncomplemented variables. The proofs, which verify the algorithm and show the uniqueness of the results, are simple and free of special mathematical notation.

Explore More