Naoya Maruyama | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Naoya Maruyama is active.

Explore More

Publication

Featured researches published by Naoya Maruyama.

ieee international conference on high performance computing data and analytics | 2011

FTI: high performance fault tolerance interface for hybrid systems

Leonardo Bautista-Gomez; Seiji Tsuboi; Dimitri Komatitsch; Franck Cappello; Naoya Maruyama; Satoshi Matsuoka

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while check-pointing at high frequency.

ieee international conference on high performance computing data and analytics | 2011

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Naoya Maruyama; Kento Sato; Tatsuo Nomura; Satoshi Matsuoka

This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small set of declarative constructs that allow the user to express stencil computations in a portable and implicitly parallel manner. Our framework translates the user-written code into actual implementation code in CUDA for GPU acceleration and MPI for node-level parallelization with automatic optimizations such as computation and communication overlapping. We demonstrate the feasibility of such automatic translations by implementing several structured grid applications in our framework. Experimental results on the TSUBAME2.0 GPU-based supercomputer show that the performance is comparable as hand-written code and good strong and weak scalability up to 256 GPUs.

ieee international conference on high performance computing data and analytics | 2011

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer

Takashi Shimokawabe; Takayuki Aoki; Tomohiro Takaki; Toshio Endo; Akinori Yamanaka; Naoya Maruyama; Akira Nukada; Satoshi Matsuoka

The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.

international conference on green computing | 2010

Statistical power modeling of GPU kernels using performance counters

Hitoshi Nagasaka; Naoya Maruyama; Akira Nukada; Toshio Endo; Satoshi Matsuoka

We present a statistical approach for estimating power consumption of GPU kernels. We use the GPU performance counters that are exposed for CUDA applications, and train a linear regression model where performance counters are used as independent variables and power consumption is the dependent variable. For model training and evaluation, we use publicly available CUDA applications, consisting of 49 kernels in the CUDA SDK and the Rodinia benchmark suite. Our regression model achieves highly accurate estimates for many of the tested kernels, where the average error ratio is 4.7%. However, we also find that it fails to yield accurate estimates for kernels with texture reads because of the lack of performance counters for monitoring texture accesses, resulting in significant underestimation for such kernels.

ieee international conference on high performance computing data and analytics | 2010

An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

Takashi Shimokawabe; Takayuki Aoki; Chiashi Muroi; Junichi Ishida; Kohei Kawano; Toshio Endo; Akira Nukada; Naoya Maruyama; Satoshi Matsuoka

Regional weather forecasting demands fast simulation over fine-grained grids, resulting in extremely memory- bottlenecked computation, a difficult problem on conventional supercomputers. Early work on accelerating mainstream weather code WRF using GPUs with their high memory performance, however, resulted in only minor speedup due to partial GPU porting of the huge code. Our full CUDA porting of the high- resolution weather prediction model ASUCA is the first such one we know to date; ASUCA is a next-generation, production weather code developed by the Japan Meteorological Agency, similar to WRF in the underlying physics (non-hydrostatic model). Benchmark on the 528 (NVIDIA GT200 Tesla) GPU TSUBAME Supercomputer at the Tokyo Institute of Technology demonstrated over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision for 6956 x 6052 x 48 mesh. Further benchmarks on TSUBAME 2.0, which will embody over 4000 NVIDIA Fermi GPUs and deployed in October 2010, will be presented.

cluster computing and the grid | 2007

Virtual Clusters on the Fly - Fast, Scalable, and Flexible Installation

Hideo Nishimura; Naoya Maruyama; Satoshi Matsuoka

One of the advantages in virtualized computing clusters compared to traditional shared HPC environments is their ability to accommodate user-specific system customization. However, past attempts to providing virtual clusters are not scalable with increasing number of VMs, nor do they allow fine-grained customization of VMs, assuming that preconfigured VM images are always available on the grid. We propose a new virtual cluster installation technique that achieves efficiency and scalability, and yet simultaneously fine-grained customizability. It allows the user to create VMs on the fly for fine-grained customization of VMs, and pipelined data transfer for scalable installation with increasing number of VMs. To achieve efficiency in the presence of such full customization, it automatically caches frequently-constructed virtual disk images to save software installation time in common cases. Our experimental studies using a prototype implementation show that installation of a 190-node virtual cluster can be done in 40 seconds. From this result along with a scalability study, we estimate that installation of a 1000-node virtual cluster could be done in less than two minutes.

conference on high performance computing (supercomputing) | 2006

Problem diagnosis in large-scale computing environments

Alexander V. Mirgorodskiy; Naoya Maruyama; Barton P. Miller

We describe a new approach for locating the causes of anomalies in distributed systems. Our target environment is a distributed application that contains multiple identical processes performing similar activities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped earlier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCoreWe describe a new approach for locating the causes of anomalies in distributed systems. Our target en- vironment is a distributed application that contains multiple identical processes performing similar ac- tivities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped ear- lier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCore.

ieee international conference on high performance computing data and analytics | 2012

Design and modeling of a non-blocking checkpointing system

Kento Sato; Kathryn Mohror; Adam Moody; Todd Gamblin; B.R. de Supinski; Naoya Maruyama; Satoshi Matsuoka

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on todays machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.

grid computing | 2010

Distributed Diskless Checkpoint for Large Scale Systems

Leonardo Bautista Gomez; Naoya Maruyama; Franck Cappello; Satoshi Matsuoka

In high performance computing (HPC), the applications are periodically check pointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by disk-based checkpoint is about 20% of execution time and in the next years it will be more than 50% if the checkpoint frequency increases as the fault frequency increases. Diskless checkpoint has been introduced as a solution to avoid the IO bottleneck of disk-based checkpoint. However, the encoding time, the dedicated resources (the spares) and the memory overhead imposed by diskless checkpoint are significant obstacles against its adoption. In this work, we address these three limitations: 1) we propose a fault tolerant model able to tolerate up to 50% of process failures with a low check pointing overhead 2) our fault tolerance model works without spare node, while still guarantying high reliability, 3) we use solid state drives to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint.

international parallel and distributed processing symposium | 2008

An efficient, model-based CPU-GPU heterogeneous FFT library

Yasuhito Ogata; Toshio Endo; Naoya Maruyama; Satoshi Matsuoka

General-purpose computing on graphics processing units (GPGPU) is becoming popular in HPC because of its high peak performance. However, in spite of the potential performance improvements as well as recent promising results in scientific computing applications, its real performance is not necessarily higher than that of the current high-performance CPUs, especially with recent trends towards increasing the number of cores on a single die. This is because the GPU performance can be severely limited by such restrictions as memory size and bandwidth and programming using graphics-specific APIs. To overcome this problem, we propose a model-based, adaptive library for 2D FFT that automatically achieves optimal performance using available heterogeneous CPU-GPU computing resources. To find optimal load distribution ratios between CPUs and GPUs, we construct a performance model that captures the respective contributions of CPU vs. GPU, and predicts the total execution time of 2D-FFT for arbitrary problem sizes and load distribution. The performance model divides the FFT computation into several small sub steps, and predicts the execution time of each step using profiling results. Preliminary evaluation with our prototype shows that the performance model can predict the execution time of problem sizes that are 16 times as large as the profile runs with less than 20% error, and that the predicted optimal load distribution ratios have less than 1% error. We show that the resulting performance improvement using both CPUs and GPUs can be as high as 50% compared to using either a CPU core or a GPU.

Explore More