Silvia Melitta Mueller

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Silvia Melitta Mueller is active.

Explore More

Publication

Featured researches published by Silvia Melitta Mueller.

international solid-state circuits conference | 2005

A streaming processing unit for a CELL processor

Brian Flachs; Shigehiro Asano; Sang Hoo Dhong; P. Hotstee; Gilles Gervais; Roy Moonseuk Kim; T. Le; Peichun Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; H. Oh; Silvia Melitta Mueller; Osamu Takahashi; A. Hatakeyama; Yukio Watanabe; Naoka Yano

The design of a 4-way SIMD streaming data processor emphasizes achievable performance in area and power. Software controls data movement and instruction flow, and improves data bandwidth and pipeline utilization. The micro-architecture minimizes instruction latency and provides fine-grain clock control to reduce power.

IEEE Journal of Solid-state Circuits | 2006

The microarchitecture of the synergistic processor for a cell processor

Brian Flachs; Shigehiro Asano; Sang Hoo Dhong; Harm Peter Hofstee; Gilles Gervais; Roy Kim; T. Le; Peichun Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; Hwa-Joon Oh; Silvia Melitta Mueller; Osamu Takahashi; A. Hatakeyama; Yukio Watanabe; Naoka Yano; Daniel Alan Brokenshire; Mohammad Peyravian; Vandung To; E. Iwata

This paper describes an 11 FO4 streaming data processor in the IBM 90-nm SOI-low-k process. The dual-issue, four-way SIMD processor emphasizes achievable performance per area and power. Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density. The design minimizes instruction latency while providing for fine grain clock control to reduce power.

IEEE Journal of Solid-state Circuits | 2006

A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor

Hwa Joon Oh; Silvia Melitta Mueller; Christian Jacobi; Kevin D. Tran; Scott R. Cottier; Brad W. Michael; Hiroo Nishikawa; Yonetaro Totsuka; Tatsuya Namatame; Naoka Yano; Takashi Machida; Sang Hoo Dhong

The floating-point unit (FPU) in the synergistic processor element (SPE) of a CELL processor is a fully pipelined 4-way single-instruction multiple-data (SIMD) unit designed to accelerate media and data streaming with 128-bit operands. It supports 32-bit single-precision floating-point and 16-bit integer operands with two different latencies, six-cycle and seven-cycle, with 11 FO4 delay per stage. The FPU optimizes the performance of critical single-precision multiply-add operations. Since exact rounding, exceptions, and de-norm number handling are not important to multimedia applications, IEEE correctness on the single-precision floating-point numbers is sacrificed for performance and simple design. It employs fine-grained clock gating for power saving. The design has 768K transistors in 1.3 mm/sup 2/, fabricated SOI in 90-nm technology. Correct operations have been observed up to 5.6 GHz with 1.4 V and 56/spl deg/C, delivering 44.8 GFlops. Architecture, logic, circuits, and integration are codesigned to meet the performance, power, and area goals.

symposium on computer arithmetic | 2005

The vector floating-point unit in a synergistic processor element of a CELL processor

Silvia Melitta Mueller; Christian Jacobi; Hwa-Joon Oh; Kevin D. Tran; Scott R. Cottier; Brad W. Michael; Hiroo Nishikawa; Yonetaro Totsuka; Tatsuya Namatame; Naoka Yano; Takashi Machida; Sang Hoo Dhong

The floating-point unit in the synergistic processor element of the 1st generation multi-core CELL processor is described. The FPU supports 4-way SIMD single precision and integer operations and 2-way SIMD double precision operations. The design required a high-frequency, low latency, power and area efficiency with primary application to the multimedia streaming workloads, such as 3D graphics. The FPU has 3 different latencies, optimizing the performance critical single precision FMA operations, which are executed with a 6-cycle latency at an 11FO4 cycle time. The latency includes the global forwarding of the result. These challenging performance, power, and area goals were achieved through the co-design of architecture and implementation with optimizations at all levels of the design. This paper focuses on the logical and algorithmic aspects of the FPU we developed, to achieve these goals.

Integration | 2000

A dual precision IEEE floating-point multiplier

Guy Even; Silvia Melitta Mueller; Peter-Michael Seidel

A new algorithm for computing IEEE-compliant rounding is presented, called injection-based rounding. Injection-based rounding is simple and facilitates using the same rounding circuitry for different precisions. We demonstrate the usefulness of injection-based rounding in a design of an IEEE floating-point multiplier capable of performing either a double-precision multiplication or a single-precision multiplication. The multiplier is designed to minimize hardware cost by using only a half-sized multiplication array and by sharing the rounding circuitry for both precisions. The latency of the multiplier is in single-precision two clock cycles and in double precision the latency is three clock cycles, where each pipeline stage contains roughly 15 logic levels.

international conference on information systems security | 1997

A dual mode IEEE multiplier

Guy Even; Silvia Melitta Mueller; Peter-Michael Seidel

We present an IEEE floating-point multiplier capable of performing either a double-precision multiplication or a single-precision multiplication. In single-precision the latency is two clock cycles and in double-precision the latency is three clock cycles, where each pipeline stage contains roughly fifteen logic levels. A single-precision multiplication can be followed immediately by another multiplication of either single or double-precision, A double-precision multiplication requires one stall cycle, namely, two cycles after issuing a double-precision multiplication, a new multiplication of either precision can be issued. Therefore, the throughput in single-precision is one multiplication per clock cycle, and the throughput in double-precision is one multiplication per two clock cycles. Hardware cost is reduced by using only a half-sized multiplication array and by sharing the rounding circuitry for both precisions.

symposium on computer arithmetic | 2011

The IBM zEnterprise-196 Decimal Floating-Point Accelerator

Steven R. Carlough; Adam B. Collura; Silvia Melitta Mueller; Michael Kroener

Decimal floating-point Arithmetic is widely used in commercial computing applications, such as financial transactions, where rounding errors prevent the use of binary floating-point operations. The revised IEEE Standard for Floating-Point Arithmetic (IEEE-754-2008) defined standardized decimal floating-point (DFP) formats. As more software applications adopt the IEEE decimal floating-point standard, hardware accelerators that support it are becoming more prevalent. This paper describes the second generation decimal floating-point accelerator implemented on the IBM zEnterprise-196 processor. The 4-cycle deep pipeline was designed to optimize the latency of fixed-point decimal operations while significantly improving the bandwidth of DFP operations. A detailed description of the unit and a comparison to previous implementations found in literature is provided.

symposium on vlsi circuits | 2005

A fully-pipelined single-precision floating point unit in the synergistic processor element of a CELL processor

The floating point unit in the synergistic processor element of a CELL processor is a fully-pipelined 4-way SIMD unit designed to accelerate media and data streaming. It supports 32-bit single-precision floating point and 16-bit integer operands with two different latencies, optimizing the performance of critical single-precision multiply-add operations. It employs fine-grained clock gating for power saving. Architecture, logic, circuits and integration are co-designed to meet the performance, power, and area goals.

symposium on computer arithmetic | 2009

Advanced Clockgating Schemes for Fused-Multiply-Add-Type Floating-Point Units

Jochen Preiss; Maarten J. Boersma; Silvia Melitta Mueller

The paper introduces fine-grain clockgating schemes for fused multiply-add-type floating-point units (FPU). The clockgating is based on instruction type, precision and operand values. The presented schemes focus on reducing the power at peak performance, where each FPU stage is used in nearly every cycle and conventional schemes have little impact on the power consumption. Depending on the instruction mix, the schemes allow to turn off 18% to 74%of the register bits. Even for the worst case instruction 18% to 37% of the FPU are shut down depending on the data patterns.

Archive | 2000

Proving the Correctness of Processors with Delayed Branch using Delayed PC

Silvia Melitta Mueller; Wolfgang J. Paul; Daniel Kroening

We show that the programming model of delayed branch is equivalent to what we call delayed PC: all instruction fetches are delayed by one instruction, not just taken branches. This leads to a very simple new implementation of the delayed branch mechanism. We then prove the correctness of a pipelined machine with delayed PC.

Explore More