Naoka Yano
Toshiba
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Naoka Yano.
international solid-state circuits conference | 2005
Brian Flachs; Shigehiro Asano; Sang Hoo Dhong; P. Hotstee; Gilles Gervais; Roy Moonseuk Kim; T. Le; Peichun Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; H. Oh; Silvia Melitta Mueller; Osamu Takahashi; A. Hatakeyama; Yukio Watanabe; Naoka Yano
The design of a 4-way SIMD streaming data processor emphasizes achievable performance in area and power. Software controls data movement and instruction flow, and improves data bandwidth and pipeline utilization. The micro-architecture minimizes instruction latency and provides fine-grain clock control to reduce power.
IEEE Journal of Solid-state Circuits | 2006
Brian Flachs; Shigehiro Asano; Sang Hoo Dhong; Harm Peter Hofstee; Gilles Gervais; Roy Kim; T. Le; Peichun Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; Hwa-Joon Oh; Silvia Melitta Mueller; Osamu Takahashi; A. Hatakeyama; Yukio Watanabe; Naoka Yano; Daniel Alan Brokenshire; Mohammad Peyravian; Vandung To; E. Iwata
This paper describes an 11 FO4 streaming data processor in the IBM 90-nm SOI-low-k process. The dual-issue, four-way SIMD processor emphasizes achievable performance per area and power. Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density. The design minimizes instruction latency while providing for fine grain clock control to reduce power.
IEEE Journal of Solid-state Circuits | 2006
Hwa Joon Oh; Silvia Melitta Mueller; Christian Jacobi; Kevin D. Tran; Scott R. Cottier; Brad W. Michael; Hiroo Nishikawa; Yonetaro Totsuka; Tatsuya Namatame; Naoka Yano; Takashi Machida; Sang Hoo Dhong
The floating-point unit (FPU) in the synergistic processor element (SPE) of a CELL processor is a fully pipelined 4-way single-instruction multiple-data (SIMD) unit designed to accelerate media and data streaming with 128-bit operands. It supports 32-bit single-precision floating-point and 16-bit integer operands with two different latencies, six-cycle and seven-cycle, with 11 FO4 delay per stage. The FPU optimizes the performance of critical single-precision multiply-add operations. Since exact rounding, exceptions, and de-norm number handling are not important to multimedia applications, IEEE correctness on the single-precision floating-point numbers is sacrificed for performance and simple design. It employs fine-grained clock gating for power saving. The design has 768K transistors in 1.3 mm/sup 2/, fabricated SOI in 90-nm technology. Correct operations have been observed up to 5.6 GHz with 1.4 V and 56/spl deg/C, delivering 44.8 GFlops. Architecture, logic, circuits, and integration are codesigned to meet the performance, power, and area goals.
symposium on computer arithmetic | 2005
Silvia Melitta Mueller; Christian Jacobi; Hwa-Joon Oh; Kevin D. Tran; Scott R. Cottier; Brad W. Michael; Hiroo Nishikawa; Yonetaro Totsuka; Tatsuya Namatame; Naoka Yano; Takashi Machida; Sang Hoo Dhong
The floating-point unit in the synergistic processor element of the 1st generation multi-core CELL processor is described. The FPU supports 4-way SIMD single precision and integer operations and 2-way SIMD double precision operations. The design required a high-frequency, low latency, power and area efficiency with primary application to the multimedia streaming workloads, such as 3D graphics. The FPU has 3 different latencies, optimizing the performance critical single precision FMA operations, which are executed with a 6-cycle latency at an 11FO4 cycle time. The latency includes the global forwarding of the result. These challenging performance, power, and area goals were achieved through the co-design of architecture and implementation with optimizations at all levels of the design. This paper focuses on the logical and algorithmic aspects of the FPU we developed, to achieve these goals.
symposium on vlsi circuits | 2005
Hwa Joon Oh; Silvia Melitta Mueller; Christian Jacobi; Kevin D. Tran; Scott R. Cottier; Brad W. Michael; Hiroo Nishikawa; Yonetaro Totsuka; Tatsuya Namatame; Naoka Yano; Takashi Machida; Sang Hoo Dhong
The floating point unit in the synergistic processor element of a CELL processor is a fully-pipelined 4-way SIMD unit designed to accelerate media and data streaming. It supports 32-bit single-precision floating point and 16-bit integer operands with two different latencies, optimizing the performance of critical single-precision multiply-add operations. It employs fine-grained clock gating for power saving. Architecture, logic, circuits and integration are co-designed to meet the performance, power, and area goals.
IEEE Journal of Solid-state Circuits | 1996
Hiroaki Murakami; Naoka Yano; Yukio Ootaguro; Yukio Sugeno; M. Ueno; Y. Muroya; T. Aramaki
This paper describes a high speed and area effective multiplier-accumulator for an embedded RISC processor. The point is to utilize a multiplier array and the Booths encoder twice in a cycle. This multiplier-accumulator can execute one multiply-add operation (32bit multiplication followed by 64bit addition) per cycle at 50MHz. The area is 2.35mm2 with 0.4¿m CMOS technology.
Ibm Journal of Research and Development | 2007
Brian Flachs; S. Asano; Sang Hoo Dhong; Harm Peter Hofstee; Gilles Gervais; Roy Moonseuk Kim; T. N. Le; P. Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; H.-J. Oh; Stefan Mueller; Osamu Takahashi; K. Hirairi; A. Kawasumii; H. Murakami; H. Noro; S. Onishi; J. Pille; J. Silberman; S. Yong; A. Hatakeyama; Y. Watanabe; Naoka Yano; Daniel Alan Brokenshire; Mohammad Peyravian; V. To; Eiji Iwata
This paper describes the architecture and implementation of the original gaming-oriented synergistic processor element (SPE) in both 90-nm and 65-nm silicon-on-insulator (SOI) technology and introduces a new SPE implementation targeted for the high-performance computing community. The Cell Broadband Engine™ processor contains eight SPEs. The dual-issue, four-way single-instruction multiple-data processor is designed to achieve high performance per area and power and is optimized to process streaming data, simulate physical phenomena, and render objects digitally. Most aspects of data movement and instruction flow are controlled by software to improve the performance of the memory system and the core performance density. The SPE was designed as an 11-F04 (fan-out-of-4-inverter-delay) processor using 20.9 million transistors within 14.8 mm 2 using the IBM 90-nm SOI low-k process. CMOS (complementary metal-oxide semiconductor) static gates implement the majority of the logic. Dynamic circuits are used in critical areas and occupy 19% of the non-static random access memory (SRAM) area. Instruction set architecture, microarchitecture, and physical implementation are tightly coupled to achieve a compact and power-efficient design. Correct operation has been observed at up to 5.6 GHz and 7.3 GHz, respectively, in 90-nm and 65-nm SOI technology.
symposium on vlsi circuits | 2007
Kiyoji Ueno; Hiroaki Murakami; Naoka Yano; Ryubi Okuda; Toshihiko Himeno; Takayuki Kamei; Yukihiro Urakawa
A 7.07 mm2 synthesizable streaming processing unit (SPU) is fabricated in a 65 nm CMOS technology with 8 level copper layers. It is migrated from its original custom design to a synthesizable design to get higher design portability. New features are a new floor plan, height optimized standard cell library, local clock generator cloning and adaptive wire width control. Its logic area is 30% smaller than the full custom designed SPU in the same process generation. Correct functional operation is realized in 4 GHz at 1.4 V.
Archive | 1998
Naoka Yano; Naoyuki Tamura
Archive | 2003
Naoka Yano