Akira Katsuno | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Akira Katsuno is active.

Explore More

Publication

Featured researches published by Akira Katsuno.

Expert Systems With Applications | 1995

Parallel processor which processes instructions, after a branch instruction, in parallel prior to executing the branch instruction and its parallel processing method

Akira Katsuno

In a typical operating system, one-third of a program consists of branch instructions. This means a performance of a processor of a typical operating system depends greatly on whether or not an instruction before and after a branch instruction can be executed in parallel. In order to provide a high performance processor with parallel processing, provided is a structure with a plurality of operating units and a plurality of registers where a set of registers are specified with the same address. A selection sequence of registers is stored by a plurality of selection sequence storages. Contents of registers are determined or not depending on the information stored in a plurality of determination identification storages. A register is specified by a register selector according to the contents of the selection sequence storages. This register selector is also used to update the contents of the selection sequence storages. The contents of the determination identification storages are rewritten by a determination identifier when the contents of a register proves to be a correct result.

ieee computer society international conference | 1995

Microarchitecture of HaL's CPU

Niteen A. Patkar; Akira Katsuno; Simon Li; Tak Maruyama; Sunil Savkar; Mike Simone; Gene Shen; Ravi Swami; Deforest W Tovey

The HaL PM1 CPU is the first implementation of the 64-bit SPARC Version 9 instruction set architecture. The processor utilizes superscalar instruction issue, register renaming, and a dataflow model of execution. Instructions can complete out-of-order and are later committed in order. The PM1 CPU maintains precise state. The processor has a higher level of reliability than is currently available in desktop computers for the commercial marketplace.

high-performance computer architecture | 2003

Microarchitecture and performance analysis of a SPARC-V9 microprocessor for enterprise server systems

Mariko Sakamoto; Akira Katsuno; Aiichiro Inoue; Takeo Asakawa; Haruhiko Ueno; Kuniki Morita; Yasunori Kimura

We developed a 1.3-GHz SPARC-V9 processor: the SPARC64 V. This processor is designed to address requirements for enterprise servers and high-performance computing. Processing speed under multiuser interactive workloads is very sensitive to system balance because of the large number of memory requests included. From many years of experience with such workloads in mainframe system developments, we give importance to design a well-balanced communication structure. To accomplish this task, a system-level performance study must begin at an early please. Therefore we developed a performance model, which consists of a detailed processor model and detailed memory model, before hardware design was started. We updated it continuously. Once a logic simulator became available, we used it to verify the performance model for improving its accuracy. The model quite effectively enabled us to achieve performance goals and finish development quickly. This paper describes the SPARC64 V microarchitecture and performance analyses for hardware design.

international conference on computer design | 1990

A 64-bit floating-point processing unit with a horizontal instruction code for parallel operations

Akira Katsuno; Hiromasa Takahashi; Hajime Kubosawa; Tomio Sato; Atsuhiro Suga; Gensuke Goto

A full 64-bit floating-point processing unit (FPU) with a long horizontal instruction code for parallel operations without pipeline interlock is described. The FPU is implemented on a 1.0- mu m CMOS chip containing 300 K transistors and operating at 25 MHz. It runs at a peak rate of 50 MFLOPs and a sustained rate of 15.4 MFLOPs. The register-to-register latency of double and single-precision addition, subtraction and multiplication are 120 ns each. The latency of double-precision division is 640 ns and that of square root is 880 ns.<<ETX>>

Proceedings Euro ASIC '92 | 1992

A 64-bit floating point processing unit for a RISC microprocessor

Hajime Kubosawa; Akira Katsuno; Hiromasa Takahashi; Tomio Sato; Atsuhiro Suga; Gensuke Goto

Describes architecture, layout, and simulation methodology of a high performance 64-bit floating point processing unit (FPU) which is applicable to a RISC microprocessor. The FPU contains a floating point execution unit and a floating point controller for the SPARC S-25 microprocessor. The FPU supports SPARC floating point instructions based on the IEEE Standard for Binary Floating Point Arithmetic (ANSI/IEEE std. 754-1985). Operating frequency is 25 MHz and peak floating point computing performance is 12.5 MFLOPS when it is used with the S-25 SPARC microprocessor. The chip was designed using 0.8 mu m CMOS standard cell technology. The chip size is 16.4*16.4 mm and packaged into 179-pin PGA. Total transistor count was approximately 330000.<<ETX>>

IEICE Transactions on Electronics | 2007

A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

Mariko Sakamoto; Akira Katsuno; Go Sugizaki; Toshio Yoshida; Aiichiro Inoue; Koji Inoue; Kazuaki Murakami

Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.

Archive | 2006