Kenichi Miura | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kenichi Miura is active.

Explore More

Publication

Featured researches published by Kenichi Miura.

Computer Physics Communications | 1987

EGS4V: Vectorization of the Monte Carlo cascade shower simulation code EGS4

Kenichi Miura

Abstract This paper describes the vectorization method for the Electromagnetic Cascade Shower Simulation Code EGS4. A new control scheme and a new data structure suitable for vectorization of the transport Monte Carlo Simulations are discussed. It is shown that a vectorized version can achieve more than 8 times performance improvement over the original scalar EGS4 code on AMDAHL 1200 Vector Processor System for a sample problem.

Computer Science - Research and Development | 2013

The design of ultra scalable MPI collective communication on the K computer

Tomoya Adachi; Naoyuki Shida; Kenichi Miura; Shinji Sumimoto; Atsuya Uno; Motoyoshi Kurokawa; Fumiyoshi Shoji; Mitsuo Yokokawa

This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world’s first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer.On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones.

conference on high performance computing (supercomputing) | 1994

A high performance linear equation solver on the VPP500 parallel supercomputer

Makoto Nakanishi; Hiroshi Ina; Kenichi Miura

This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500: scalability for an arbitrary number of processors up to 222 processors; flexible data transfer among processors provided by a crossbar interconnection network; vector processing capability on each processor; and overlapped computation and data transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LINPACK Highly Parallel Computing benchmark.<<ETX>>

parallel computing | 1988

Tradeoffs in granularity and parallelization for a Monte Carlo shower simulation code

Kenichi Miura; Robert G. Babb

Abstract The EGS4 code, developed at Stanford Linear Accelerator Center, simulates electron-photon cascading phenomena. The original code is inherently sequential: processing one particle at a time. This paper reports on a series of experiments in parallelizing different versions of EGS4. Our parallel experiments were run on a 30-processor Sequent Balance B21 and a 6-processor Symmetry S27. We have considered the following approaches for parallel execution of this application code: 1. (1) Original sequential version modified for parallel processing: 1 processor; 2. (2) Version 1 run multiprocessed: 1 to 29 processors; 3. (3) Sequential version modified for large-grain parallel processing: 1 procssor; 4. (4) Version 3 run using the Sequent Microtasking Library: 1 to 29 processors. For each approach, we discuss the relative advantages and disadvantages in the areas of coding effort, understandability and portability, as well as performance, and outline a new parallelization approach we are currently pursuing based on Large-Grain Data Flow techniques.

international conference on supercomputing | 2014

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Yuichiro Ajima; Tomohiro Inoue; Shunji Uno; Shinji Sumimoto; Kenichi Miura; Naoyuki Shida; Takahiro Kawashima; Takayuki Okamoto; Osamu Moriyama; Yoshiro Ikeda; Takekazu Tabata; Takahide Yoshikawa; Ken Seki; Toshiyuki Shimizu

The Tofu Interconnect 2 Tofu2 is a system interconnect designed for the Fujitsus next generation successor to the PRIMEHPC FX10 supercomputer. Tofu2 inherited the 6-dimensional mesh/torus network topology from its predecessor, and it increases the link throughput by two and half times. It is integrated into a newly developed SPARC64TM processor chip and takes advantages of system-on-chip implementation by removing off-chip I/O between a processor chip and an interconnect controller. Tofu2 also introduces new features such as the atomic read-modify-write communication functions, the session-mode control queue for the offloading of collective communications, and harmless cache injection technique to reduce communication latency.

Digest of Papers. Compcon Spring | 1993

Overview of the Fujitsu VPP500 supercomputer

Kenichi Miura; Moriyuki Takamura; Yoshinori Sakamoto; Shin Okada

The authors present an overview of the Fujitsu VPP500 vector parallel processor. The VPP500 is a high-performance, highly parallel distributed memory system. A crossbar network interconnects 4 to 222 processing elements, which gives a maximum system performance of up to 355 GFLOPS and an aggregate memory capacity of at most 55 Gbyte. The UNIX SVR4-based operating system, modified for the VPP500s distributed memory environment, supports the FORTRAN77 compiler to present the programmer with a high-performance, shared memory paradigm.<<ETX>>

winter simulation conference | 1990

Vectorization and parallelization of transport Monte Carlo simulation codes

Kenichi Miura

Discusses the computational techniques, the coding methodology, and the performance for transport Monte Carlo simulation on vector supercomputers and on shared-memory parallel processors. A cascade shower simulation code EGS4 is used as an example. For vector processing, a more than 10-fold increase in performance has been obtained by treating the problem in a different manner from conventional sequential processing in such a way as to exploit the vector architecture of current supercomputers. For parallel processing, a more than 25-fold increase in performance has been obtained over sequential processing by using 29 processors. The authors also discuss an analytical performance model for parallel processing, issues in the parallel processing of the transport Monte Carlo codes, and comparisons between vector and parallel approaches.<<ETX>>

Computer Physics Communications | 1985

Supervector performance without toil: FORTRAN implemented vector algorithms on the VP-100/200

Toshihiko Matsuura; Kenichi Miura; Mitsuhiro Makino

Abstract The advance architecture and software of Fujitsus new vector machines, the FACOM VP-100/200, allow one to extract maximum performance from a program by FORTRAN coding. Three examples of basic algorithms, i.e., triangularization of a symmetric matrix, a radix-2 FFT, and a random number generator, are implemented on the VP-200 to exemplify the effectiveness of the well-balanced compiler capability against the VP architecture.

parallel computing | 1995

Hardware Performance of the VPP500 Parallel Supercomputer

Moriyuki Takamura; Kenichi Miura; Akira Nodomi; Masayuki Ikeda

This paper describes the performance of the Fujitsu VPP500, at both the hardware and application levels. The VPP500 is a distributed memory parallel supercomputer that is based on high performance vector processing elements interconnected by a crossbar network. First, we measured the performance of basic aspects of the VPP500 and confirmed its vector performance and its high data transfer performance among processing elements. The replicated functional units in each of the vector pipelines give the VPP500 its high vector performance. The interprocessor communication hardware achieves bandwidth that approaches the peak rate (400 Mbytes/s) for representative transfer patterns and provides good throughput even for relatively small-sized data. We also measured the performance of the VPP500 using the NAS Parallel Benchmark Suite.

international conference on supercomputing | 1995

Scalar processor of the VPP500 parallel supercomputer

Yasuhiko Nakashima; Toshiaki Kitamura; Hideo Tamura; Masaaki Takiuchi; Kenichi Miura

This paper describes the scalar processor of the VPP500. The 64-bit long instruction word (LIW) architecture allows the issue of up to three operations every clock cycle. Notable features of the architecture are the PC-relative conditional branch operations, asynchronous operations to allow out-of-order execution, and interruption facility which can handle multiple exceptions. The VPP500 implementation has a three stage pipeline which reduces the hardware cost and branch penalty. The two integer ALUs and the pipelined floating-point units allow up to two integer operations or two floating point operations to be issued per cycle. The wide data path from the cache enables the loading of a pair of either general purpose or floating point registers per cycle.

Explore More