Kisaburo Nakazawa
University of Tsukuba
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kisaburo Nakazawa.
conference on high performance computing (supercomputing) | 1992
Kisaburo Nakazawa; Hiroshi Nakamura; Hiromitsu Imori; Shun Kawabe
The authors present a novel architecture for a high-speed pseudo vector processor based on a superscalar pipeline. Without using cache memory, the proposed architecture is able to overcome the penalty of memory access latency by introducing register windows with register preloading and pipelined memory. One outstanding feature of the proposed architecture is that it is upwardly compatible with existing scalar architectures. Performance evaluation of the proposed architecture using the Livermore Loop Kernels shows over 6 times higher performance than a usual superscalar processor and 1.2 times higher performance than a hypothetical extended model with a cache prefetching technique with a memory access latency of 20 CPU clock cycles. List vectors are also effectively handled in a similar architecture.<<ETX>>
international conference on supercomputing | 1997
Taisuke Boku; K. Itakura; Hiroshi Nakamura; Kisaburo Nakazawa
CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processor with 2048 processing units built at Center for Computational Physics, University of Tsukuba. It has an MIMD architecture with distributed memory system. The node processor of CPPACS is a RISC microprocessor enhanced by Pseudo Vector Processing feature, which can realize high-performance vector processing. The interconnection network is 3-dimensional Hyper-Crossbar Network, which has high exibility and embeddability for various network topologies and communication patterns. The theoretical peak performance of whole system is 614.4 GFLOPS. In this paper, we describe the overview of CP-PACS architecture and several special architectural characteristics of it. Then, several performance evaluations both for single node processor and for parallel system are described based on LINPACK and Kernel CG of NAS Parallel Benchmarks. Through these evaluations, the e ectiveness of Pseudo Vector Processing and Hyper-Crossbar Network is shown.
international conference on supercomputing | 1993
Hiroshi Nakamura; Taisuke Boku; Hideo Wada; Hiromitsu Imori; Ikuo Nakata; Yasuhiro Inagami; Kisaburo Nakazawa; Yoshiyuki Yamashita
In this paper, we present a new scalar architecture for high-speed vector processing. Without using cache memory, the proposed architecture tolerates main memory access latency by introducing slide-windowed floating-point registers with data preloading feature and pipelined memory. The architecture can hold upward compatibility with existing scalar architectures. In the new architecture, software can control the window structure. This is the advantage compared with our previous work of register-windows. Because of this advantage, registers are utilized more flexibly and computational efficiency is largely enhanced. Furthermore, this flexibility helps the compiler to generate efficient object codes easily. We have evaluated its performance on Livermore Fortran Kernels. The evaluation results show that the proposed architecture reduces the penalty of main memory access better than an ordinary scalar processor and a processor with cache prefetching. The proposed architecture with 64 registers tolerates memory access latency of 30 CPU cyles. Compared with our previous work, the proposed architecture hides longer memory access latency with fewer registers.
parallel computing | 1999
Kisaburo Nakazawa; Hiroshi Nakamura; Taisuke Boku; Ikuo Nakata; Yoshiyuki Yamashita
Abstract Computational Physics by Parallel Array Computer System (CP-PACS) is a massively parallel processor developed and in full operation at the Center for Computational Physics at the University of Tsukuba. It is an MIMD machine with a distributed memory, equipped with 2048 processing units and 128 GB of main memory. The theoretical peak performance of CP-PACS is 614.4 Gflops. CP-PACS achieved 368.2 Gflops with the Linpack benchmark in 1996, which at that time was the fastest Gflops rating in the world. CP-PACS has two remarkable features. Pseudo Vector Processing feature (PVP-SW) on each node processor, which can perform high speed vector processing on a single chip superscalar microprocessor; and a 3-dimensional Hyper-Crossbar (3-D HXB) Interconnection network, which provides high speed and flexible communication among node processors. In this article, we present the overview of CP-PACS, the architectural topics, some details of hardware and support software, and several performance results.
asia and south pacific design automation conference | 1997
Takayuki Morimoto; Kazushi Saito; Hiroshi Nakamura; Taisuke Boku; Kisaburo Nakazawa
In order to design advanced processors in a short time, designers must simulate their designs and reflect the results to the designs at the very early stages. However, conventional hardware description languages (HDLs) do not have enough ability to describe designs easily and accurately at these stages. Thus, we have proposed a new HDL called AIDL (Architecture- and Implementation-level Description Language). In this paper, in order to evaluate the effectiveness of AIDL, we describe and compare three processors in both AIDL and VHDL descriptions.
Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis | 1997
Taisuke Boku; Hiroshi Nakamura; Kisaburo Nakazawa; Y. Iwasaki
CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processor with 2048 processing units, built at the Center for Computational Physics, University of Tsukuba, Japan. The node processor of CP-PACS is a RISC microprocessor enhanced by pseudo vector processing, which can realize high performance vector processing. The interconnection network is the 3 dimensional Hyper-Crossbar Network, which has high flexibility and embeddability for various network topologies and communication patterns. The theoretical peak performance of the whole system is 614.4 GFLOPS. We present an overview of the CP-PACS architecture and several special architectural characteristics of it. A performance evaluation on the parallel LINPACK benchmark is also shown.
hawaii international conference on system sciences | 1994
Hiroshi Nakamura; Hiromitsu Imori; Yoshiyuki Yamashita; Kisaburo Nakazawa; Taisuke Boku; H. Li; Ikuo Nakata
We present a new scalar processor for high-speed vector processing and its evaluation. The proposed processor can hide long main memory access latency by introducing slide-windowed floating-point registers with data preloading feature and pipelined memory. Owing to the slide-window structure, the proposed processor can utilize more floating-point registers in keeping upward compatibility with existing scalar architecture. We have evaluated its performance on Livermore Fortran Kernels. The evaluation results show that the proposed processor drastically reduces the penalty of main memory access compared with an ordinary scalar processor. For example, the proposed processor with 96 registers hides memory access latency of 70 CPU cycles when the throughput of main memory is 8 byte/cycle. From these results, it is concluded that the proposed architecture is very suitable for high-speed vector processing.<<ETX>>
Proceedings Innovative Architecture for Future Generation High-Performance Processors and Systems | 1997
Hiroshi Nakamura; K. Itakura; Masazumi Matsubara; Taisuke Boku; Kisaburo Nakazawa
CP-PACS is a massively parallel processor (MPP) for large scale scientific computations. On September 1996, CP-PACS equipped with 2048 processors began its operation at University of Tsukuba. At that time, CP-PACS was the fastest MPP in the world on LINPACK benchmark. CP-PACS was designed to achieve very high performance in large scientific/engineering applications. A is well known that ordinary data cache is not effective in such applications because data size is much larger than cache size and because there is little temporal locality. Thus, a special mechanism for hiding long memory access latency is indispensable. Cache prefetching is a well-known technique for this purpose. In addition to cache prefetching, CP-PACS node processors implement register preloading mechanism. This mechanism enables the processor to transfer required floating-point data directly (not via data cache) between main memory and floating-point registers in pipelined way. We compare register preloading with cache prefetching by measuring real performance of CP-PACS processor and HP PA-8000 processor which implement cache prefetching and/or register preloading.
ieee region 10 conference | 1994
Hiroshi Nakamura; T. Wakabayashi; Kisaburo Nakazawa; Taisuke Boku; Hideo Wada; Yasuhiro Inagami
We present two scalar processors called PVP-SWPC and PVP-SWSW for high-speed list vector processing. Memory access latency should be tolerated for this objective. PVP-SWPC tolerates the latency by introducing slide-windowed floating-point registers and prefetch-to-cache instruction. PVP-SWSW tolerates the latency by introducing slide-windowed general and floating-point registers. Owing to the slide-window structure, both processors can utilize more registers in keeping upward compatibility with existing scalar architecture. The evaluation shows that these processors successfully hide memory latency and realize fast list vector processing.<<ETX>>
ieee international conference on high performance computing data and analytics | 1997
Y. Abei; K. Itakura; Taisuke Boku; Hiroshi Nakamura; Kisaburo Nakazawa
CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processing system with 2048 node processors for large scale scientific calculations. On a node processor of CP-PACS, there is a special hardware feature called PVP-SW (Pseudo Vector Processor based on Slide Window), which realizes an efficient vector processing on a superscalar processor without depending on the cache. The authors present the effectiveness of PVP-SW by performance measurement on a single node processor for the LINPACK benchmark. Utilizing loop unrolling techniques and the block-TLB feature, the PVP-SW function improves the basic performance up to 3.5 times faster for 1000/spl times/1000 LINPACK. This performance corresponds to the 73% of theoretical peak.