Koki Okabe | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Koki Okabe is active.

Explore More

Publication

Featured researches published by Koki Okabe.

ieee international conference on high performance computing data and analytics | 2009

Performance evaluation of NEC SX-9 using real science and engineering applications

Takashi Soga; Akihiro Musa; Youichi Shimomura; Ryusuke Egawa; Ken’ichi Itakura; Hiroyuki Takizawa; Koki Okabe; Hiroaki Kobayashi

This paper describes a new-generation vector parallel supercomputer, NEC SX-9 system. The SX-9 processor has an outstanding core to achieve over 100Gflop/s, and a software-controllable on-chip cache to keep the high ratio of the memory bandwidth to the floating-point operation rate. Moreover, its large SMP nodes of 16 vector processors with 1.6Tflop/s performance and 1TB memory are connected with dedicated network switches, which can achieve inter-node communication at 128GB/s per direction. The sustained performance of the SX-9 processor is evaluated using six practical applications in comparison with conventional vector processors and the latest scalar processor such as Nehalem-EP. Based on the results, this paper discusses the performance tuning strategies for new-generation vector systems. An SX-9 system of 16 nodes is also evaluated by using the HPC challenge benchmark suite and a CFD code. Those evaluation results clarify the highest sustained performance and scalability of the SX-9 system.

international symposium on parallel and distributed processing and applications | 2008

Effects of MSHR and Prefetch Mechanisms on an On-Chip Cache of the Vector Architecture

Akihoro Musa; Yoshiei Sato; Takashi Soga; Ryusuke Egawa; Hiroyuki Takizawa; Koki Okabe; Hiroaki Kobayashi

Vector supercomputers have been encountering the memory wall problem and their memory bandwidth per flop/s rate has decreased. To cover the insufficient memory bandwidth per flop/s rate, an on-chip vector cache has been proposed for the vector processors. Although vector caching is effective to increase the sustained performance to a certain degree, it still needs software and hardware supporting mechanisms to extract its potential. To this end, we propose miss status handling registers (MSHR) and a prefetch mechanism. This paper evaluates the performance of the vector cache with the MSHR and the prefetch mechanism on the vector supercomputer across three leading scientific applications. The MSHR is an effective mechanism for handling subsequent vector loads of the same data, which frequently appear in different schemes. The experimental results indicate that the MSHR can improve the computational performance of scientific applications by 1.45×. Moreover, we examine the performance of the prefetch mechanism on the vector cache. The prefetch mechanism increases the computational performance by 1.6×. Accordingly, the MSHR and the prefetching mechanism are very effective optimization options for vector caching of future vector supercomputers even if the vector supercomputers cannot maintain the current memory bandwidth per flop/s rate.

memory performance dealing with applications systems and architecture | 2007

An on-chip cache design for vector processors

Akihiro Musa; Yoshiei Sato; Ryusuke Egawa; Hiroyuki Takizawa; Koki Okabe; Hiroaki Kobayashi

This paper discusses the potential of an on-chip cache memory for modern vector supercomputers. The vector supercomputers can achieve the high computational efficiency for compute-intensive scientific applications. The most important factor affecting the computational performance is high memory bandwidth to provide a sufficient amount of data to the rich arithmetic units in time; the modern vector supercomputers such as NEC SX-7 and SX-8 have 4 bytes per flop (4B/FLOP) on the ratio of memory bandwidth to floating-point operations. However, the gap in performance between memory and processors has become remarkably exposed year by year in high performance computing. Therefore, it is getting harder to keep the 4B/FLOP memory bandwidth in design of future vector supercomputers. As a promising solution to cover a lack of the memory bandwidths of vector load/store units of the future vector supercomputers, we design an on-chip vector cache for the NEC SX vector processor architecture. This paper evaluates the performance of the on-chip cache memory system on the SX-7 system with 2B/FLOP or lower memory bandwidth across two kernel loops and five leading scientific applications. The results of the kernel loops demonstrate that a 2B/FLOP memory system with the on-chip cache whose hit ratio is 50% can achieve a performance comparable to that of a 4B/FLOP system without the cache. The results of the four applications indicate that the on-chip cache can improve sustained performance of the four applications by 20% to 98%. The experimental results regarding the last one show a conflicting effect of loop unrolling with vector caching, resulting in a poor hit rate. However, when loop-unrolling is disabled, its cache hit rate is improved, and the sustained performance comparable to that of the 4B/FLOP memory bandwidth without the loop-unrolling is obtained. In addition, selective caching, in which only a part of data with the high locality of reference are cached, is also effective for efficient use of the limited cache capacity.

Archive | 2008

The Potential of On-Chip Memory Systems for Future Vector Architectures

Hiroaki Kobayashi; Akihiko Musa; Yoshiei Sato; Hiroyuki Takizawa; Koki Okabe

The most advantageous feature of modern vector systems is their outstanding memory performance compared to scalar systems. This feature brings them to their high-sustained system performance when executing real application codes, which are extensively used in the fields of advanced sciences and engineering [9],[10],[1]. However, recent trends in semiconductor technology generate a strong head wind for vector systems. Thanks to the historical growth rate in on-chip silicon budget, named Moore’s law, processor performance regarding flop/s rates increases remarkably, but memory performance cannot follow it [2]. Regarding vector systems, their bytes/flop rates that show the balance between flop/s performance and memory bandwidth go down from 8 B/flop in 1998, to 4 in 2003, and to 2 in 2007. We have pointed out that reducing the memory bandwidth seriously affects the sustained system performance even in case of vector systems [3], although their absolute performance increases to a certain degree. Memory performance definitely becomes one of key points for design of future highly-efficient vector architectures to survive in an era of multi-core processors.

Archive | 2009

First Experiences with NEC SX-9

Hiroaki Kobayashi; Ryusuke Egawa; Hiroyuki Takizawa; Koki Okabe; Akihiko Musa; Takashi Soga; Yoichi Shimomura

This paper presents the new supercomputer system NEC SX-9 that has been installed at Tohoku University in March 2008. The performance of the system is evaluated by using six real application codes. The experimental results indicate that the SX-9 system achieves a speedup of up to 7 compared to our previous NEC SX-7 system for the single-CPU sustained performance. In addition, the paper examines the effects of an on-chip vector cache named ADB on the performance, and confirms performance increases between 20 and 70% by selective caching on ADB.

Archive | 2010

Large Scaled Computation of Incompressible Flows on Cartesian Mesh Using a Vector-Parallel Supercomputer

Shun Takahashi; Takashi Ishida; Kazuhiro Nakahashi; Hiroaki Kobayashi; Koki Okabe; Youichi Shimomura; Takashi Soga; Akihiko Musa

Present incompressible Navier-Stokes flow solver is developed in the framework of Building-Cube Method (BCM) which is based on a block-structured, high-density Cartesian mesh method. In this study, flow simulation around a formula-1 car which consists of 200 million cells was conducted by vector-parallel supercomputer NEC SX-9. For exploiting the performance of SX-9, the present flow solver was highly optimized for vector and parallel computation. In this paper, the computational result from the large scale simulation and the parallel efficiency in using flat-MPI or hybrid-MPI are discussed.

memory performance dealing with applications systems and architecture | 2008

A shared cache for a chip multi vector processor

Akihiro Musa; Yoshiei Sato; Takashi Soga; Koki Okabe; Ryusuke Egawa; Hiroyuki Takizawa; Hiroaki Kobayashi

This paper discusses the design of a chip multi vector processor (CMVP), especially examining the effects of an on-chip cache when the off-chip memory bandwidth is limited. As chip multiprocessors (CMPs) have become the mainstream in commodity scalar processors, the CMP architecture will be adopted to design of vector processors in the near future for harnessing a large number of transistors on a chip. To keep a higher sustained performance in execution of scientific and engineering applications, a vector processor (core) generally requires the ratio of the memory bandwidth to the arithmetic performance of at least 4 bytes/flop (B/FLOP). However, vector supercomputers have been encountering the memory wall problem due to the limited pin bandwidth. Therefore, we propose an on-chip shared cache to maintain the effective memory bandwidth for a CMVP. We evaluate the performance of the CMVP based on the NEC SX vector architecture using real scientific applications. Especially, we examine the caching effect on the sustained performance when the B/FLOP rate is decreased. The experimental results indicate that an 8 MB on-chip shared cache can improve the performance of a four-core CMVP by 15% to 40%, compared with that without the cache. This is because the shared cache can increase cache hit rates of multi-threads. Here, the shared cache employs a miss status handling registers, which has the potential for accelerating difference schemes in scientific and engineering applications. Moreover, we show that the 2 B/FLOP is enough for the CMVP to achieve a high scalability when the on-chip cache is employed.

memory performance dealing with applications systems and architecture | 2009

Performance tuning and analysis of future vector processors based on the roofline model

Yoshiei Sato; Ryuichi Nagaoka; Akihiro Musa; Ryusuke Egawa; Hiroyuki Takizawa; Koki Okabe; Hiroaki Kobayashi

Because of a recent steep drop in the ratio of memory bandwidth to computational performance (B/F) of vector processors, their advantage against scalar ones regarding relatively high sustained performance is decaying. To cover the insufficient B/F rate, an on-chip vector cache mechanism is promising for the vector processors. Although the effectiveness of the vector cache has been evaluated, cache-conscious tuning of vector codes and the analysis of the obtained performance have not been discussed yet. Under this situation, the purpose of this paper is to establish a strategy for performance tuning of a vector processor with a cache to exploit its potential. To analyze its sustained performance, this paper uses the roofline model. Several optimization techniques are applied to real scientific and engineering applications, and their effects are assessed with the model. We confirm that the model can guide users to effective tuning so as to maximize its gain. We also discuss the energy efficiency of the on-chip vector cache.

international symposium on parallel and distributed processing and applications | 2006

Implications of memory performance for highly efficient supercomputing of scientific applications

Akihiro Musa; Hiroyuki Takizawa; Koki Okabe; Takashi Soga; Hiroaki Kobayashi

This paper examines the memory performance of the vector-parallel and scalar-parallel computing platforms across five applications of three scientific areas; electromagnetic analysis, CFD/heat analysis, and seismology. Our evaluation results show that the vector platforms can achieve the high computational efficiency and hence significantly outperform the scalar platforms in the areas of these applications. We did exhaustive experiments and quantitatively evaluated representative scalar and vector platforms using real applications from the viewpoint of the system designers and developers. These results demonstrate that the ratio of memory bandwidth to floating-point operation rate needs to reach 4-bytes/flop to preserve the computational performance with hiding the memory access latencies by pipelined vector operations in the vector platforms. We also confirm that the enough number of memory banks to handle stride memory accesses leads to an increase in the execution efficiency. On the scalar platforms, the cache hit rate needs to be almost 100% to achieve the high computational efficiency.

Archive | 2010

Large-Scale Flow Computation of Complex Geometries by Building-Cube Method

Daisuke Sasaki; Shun Takahashi; Takashi Ishida; Kazuhiro Nakahashi; Hiroaki Kobayashi; Koki Okabe; Youichi Shimomura; Takashi Soga; Akihiko Musa

Three-dimensional large scale incompressible flow simulation was conducted by Building-Cube Method (BCM) that is based on equally-spaced Cartesian mesh method. To exploit expected near-future high performance computer with massive number of processors, simple algorithms have been implemented in the BCM for mesh generation and flow solver. In this study, the capability of BCM for large-scale computation was demonstrated by solving formula 1 car model with around 200 million cells. The computation was conducted on Vector-Parallel Supercomputer system of NEC SX-9 at Cyberscience center of Tohoku University. The parallel efficiency of BCM with flat MPI and hybrid MPI on vector-parallel type system was also investigated.

Explore More