Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Rio Yokota is active.

Publication


Featured researches published by Rio Yokota.


ieee international conference on high performance computing data and analytics | 2009

42 TFlops hierarchical N -body simulations on GPUs with applications in both astrophysics and turbulence

Tsuyoshi Hamada; Tetsu Narumi; Rio Yokota; Kenji Yasuoka; Keigo Nitadori; Makoto Taiji

As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical N-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous N-body simulations on GPUs that scale as O(N2), the present method calculates the O(N log N) treecode and O(N) fast multipole method (FMM) on the GPUs with unprecedented efficiency. We demonstrate the performance of our method by choosing one standard application -a gravitational N-body simulation- and one non-standard application -simulation of turbulence using vortex particles. The gravitational simulation using the treecode with 1,608,044,129 particles showed a sustained performance of 42.15 TFlops. The vortex particle simulation of homogeneous isotropic turbulence using the periodic FMM with 16,777,216 particles showed a sustained performance of 20.2 TFlops. The overall cost of the hardware was 228,912 dollars. The maximum corrected performance is 28.1TFlops for the gravitational simulation, which results in a cost performance of 124 MFlops/


Computer Physics Communications | 2011

Biomolecular electrostatics using a fast multipole BEM on up to 512 gpus and a billion unknowns

Rio Yokota; Jaydeep P. Bardhan; Matthew G. Knepley; Lorena A. Barba; Tsuyoshi Hamada

. This correction is performed by counting the Flops based on the most efficient CPU algorithm. Any extra Flops that arise from the GPU implementation and parameter differences are not included in the 124 MFlops/


Computer Physics Communications | 2013

Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

Rio Yokota; Lorena A. Barba; Tetsu Narumi; Kenji Yasuoka

.


Computer Physics Communications | 2009

Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence

Rio Yokota; Tetsu Narumi; Ryuji Sakamaki; Shun Kameoka; Shinnosuke Obi; Kenji Yasuoka

Abstract We present teraflop-scale calculations of biomolecular electrostatics enabled by the combination of algorithmic and hardware acceleration. The algorithmic acceleration is achieved with the fast multipole method ( fmm ) in conjunction with a boundary element method ( bem ) formulation of the continuum electrostatic model, as well as the bibee approximation to bem . The hardware acceleration is achieved through graphics processors, gpu s. We demonstrate the power of our algorithms and software for the calculation of the electrostatic interactions between biological molecules in solution. The applications demonstrated include the electrostatics of protein–drug binding and several multi-million atom systems consisting of hundreds to thousands of copies of lysozyme molecules. The parallel scalability of the software was studied in a cluster at the Nagasaki Advanced Computing Center, using 128 nodes, each with 4 gpu s. Delicate tuning has resulted in strong scaling with parallel efficiency of 0.8 for 256 and 0.5 for 512 gpu s. The largest application run, with over 20 million atoms and one billion unknowns, required only one minute on 512 gpu s. We are currently adapting our bem software to solve the linearized Poisson–Boltzmann equation for dilute ionic solutions, and it is also designed to be flexible enough to be extended for a variety of integral equation problems, ranging from Poisson problems to Helmholtz problems in electromagnetics and acoustics to high Reynolds number flow.


ieee international conference on high performance computing data and analytics | 2012

A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems

Rio Yokota; Lorena A. Barba

Abstract This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on gpu xa0hardware using single precision. The simulations use a vortex particle method to solve the Navier–Stokes equations, with a highly parallel fast multipole method ( fmm ) as numerical engine, and match the current record in mesh size for this application, a cube of 4096 3 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the fft xa0algorithm as the numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the fmm -based vortex method achieving 74% parallel efficiency on 4096 processes (one gpu xa0per mpi xa0process, 3 gpu s per node of the tsubame -2.0xa0system). The fft -based spectral method is able to achieve just 14% parallel efficiency on the same number of mpi xa0processes (using only cpu xa0cores), due to the all-to-all communication pattern of the fft xa0algorithm. The calculation time for one time step was 108xa0s for the vortex method and 154xa0s for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex-method calculations to date.


Computing in Science and Engineering | 2012

Hierarchical N-body Simulations with Autotuning for Heterogeneous Systems

Rio Yokota; Lorena A. Barba

Recent advances in the parallelizability of fast N-body algorithms, and the programmability of graphics processing units (GPUs) have opened a new path for particle based simulations. For the simulation of turbulence, vortex methods can now be considered as an interesting alternative to finite difference and spectral methods. The present study focuses on the efficient implementation of the fast multipole method and pseudo-particle method on a cluster of NVIDIA GeForce 8800 GT GPUs, and applies this to a vortex method calculation of homogeneous isotropic turbulence. The results of the present vortex method agree quantitatively with that of the reference calculation using a spectral method. We achieved a maximum speed of 7.48 TFlops using 64 GPUs, and the cost performance was near


Computer Methods in Applied Mechanics and Engineering | 2010

PetRBF — A parallel O(N) algorithm for radial basis function interpolation with Gaussians

Rio Yokota; Lorena A. Barba; Matthew G. Knepley

9.4/GFlops. The calculation of the present vortex method on 64 GPUs took 4120 s, while the spectral method on 32 CPUs took 4910 s.


Journal of Computational Physics | 2007

Calculation of isotropic turbulence using a pure Lagrangian vortex method

Rio Yokota; Tarun Kumar Sheel; Shinnosuke Obi

Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (fmm) appears as a rising star. Our previous recent work showed scaling of an fmm on gpu clusters, with problem sizes of the order of billions of unknowns. That work led to an extremely parallel fmm, scaling to thousands of gpus or tens of thousands of cpus. This paper reports on a campaign of performance tuning and scalability studies using multi-core cpus, on the Kraken supercomputer. All kernels in the fmm were parallelized using OpenMP, and a test using 107 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using single instruction multiple data (SIMD) instructions resulted in 4 × speed-up of the overall algorithm on single-core tests with 103–107 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 108 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2 × faster). The weak scaling test used 106 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that fmm is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape. The code is open for unrestricted use under the MIT license.


Journal of Algorithms & Computational Technology | 2013

An FMM Based on Dual Tree Traversal for Many-core Architectures

Rio Yokota

Algorithms designed to efficiently solve the classical N-body problem of mechanics fit well on GPU hardware and exhibit excellent scalability on many GPUs. Their computational intensity makes them a promising approach for other applications amenable to an N-body formulation. Adding features such as autotuning makes multipole-type algorithms ideal for heterogeneous computing environments.


arXiv: Computational Physics | 2011

Treecode and fast multipole method for N-body simulation with CUDA

Rio Yokota; Lorena A. Barba

Abstract We have developed a parallel algorithm for radial basis function ( rbf ) interpolation that exhibits O ( N ) complexity, requires O ( N ) storage, and scales excellently up to a thousand processes. The algorithm uses a gmres iterative solver with a restricted additive Schwarz method ( rasm ) as a preconditioner and a fast matrix-vector algorithm. Previous fast rbf methods — achieving at most O ( N log N ) complexity — were developed using multiquadric and polyharmonic basis functions. In contrast, the present method uses Gaussians with a small variance with respect to the domain, but with sufficient overlap. This is a common choice in particle methods for fluid simulation, our main target application. The fast decay of the Gaussian basis function allows rapid convergence of the iterative solver even when the subdomains in the rasm are very small. At the same time we show that the accuracy of the interpolation can achieve machine precision. The present method was implemented in parallel using the pets c library (developer version). Numerical experiments demonstrate its capability in problems of rbf interpolation with more than 50xa0million data points, timing at 106xa0s (19 iterations for an error tolerance of 10 −xa015 ) on 1024 processors of a Blue Gene/L (700xa0MHz PowerPC processors). The parallel code is freely available in the open-source model.

Collaboration


Dive into the Rio Yokota's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David E. Keyes

King Abdullah University of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Huda Ibeid

King Abdullah University of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Tetsu Narumi

University of Electro-Communications

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge