Reiji Suda | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Reiji Suda is active.

Explore More

Publication

Featured researches published by Reiji Suda.

Mathematics of Computation | 2002

A fast spherical harmonics transform algorithm

Reiji Suda; Masayasu Takami

The spectral method with discrete spherical harmonics transform plays an important role in many applications. In spite of its advantages, the spherical harmonics transform has a drawback of high computational complexity, which is determined by that of the associated Legendre transform, and the direct computation requires time of O(N3) for cut-off frequency N. In this paper, we propose a fast approximate algorithm for the associated Legendre transform. Our algorithm evaluates the transform by means of polynomial interpolation accelerated by the Fast Multipole Method (FMM). The divide-and-conquer approach with split Legendre functions gives computational complexity O(N2 log N). Experimental results show that our algorithm is stable and is faster than the direct computation for N ≥ 511.

IEEE Transactions on Applied Superconductivity | 1991

Quantum flux parametron: a single quantum flux device for Josephson supercomputer

Mutsumi Hosoya; Willy Hioe; Juan Casas; Ryotaro Kamikawai; Yutaka Harada; Yasou Wada; Hideaki Nakane; Reiji Suda; Eiichi Goto

A quantum flux parametron (QFP), a single quantum flux superconductive device that has a potential of up to 100-GHz switching with nW-order power dissipation, is considered. The potential of the QFP and key technologies when QFPs are applied to a Josephson supercomputer are described. Switching speed, stability, and power dissipation of a QFP are discussed. QFP gates, circuits, and systems are next described. Then, ultra-fast clock distribution using a standing wave is explained. High-speed operation at more than 10 GHz and 10/sup 14/ error-free operations per QFP have been demonstrated. Finally described is a high-density packaging scheme by three-dimensional integration, which is very important for ultra-high speed circuits because the propagation delay becomes dominant in such circuits.<<ETX>>

parallel and distributed computing: applications and technologies | 2009

Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing

Reiji Suda; Da Qi Ren

Power dissipation is one of the most imminent limitation factors influencing the development of High Performance Computing (HPC). Toward power-efficient HPC on CPU-GPU hybrid platform, we are investigating software methodologies to achieve optimized power utilization by algorithm design and programming technique. In this paper we discuss power measurements of GPU, propose a method of automatic extraction of power data of CUDA kernels from long measurement sequence, and execute an exactitude and effective power analysis on CUDA kernels. By using the proposed method above, we measured a sample kernel that performs single precision floating point additions on GeForce 8800 GTS. Our results suggest that the power consumption by a non-working thread in underoccupied half-warp is 71% of the power consumed by a working thread.

cluster computing and the grid | 2012

Accelerating 2-opt and 3-opt Local Search Using GPU in the Travelling Salesman Problem

Kamil Rocki; Reiji Suda

In this paper we are presenting high performance GPU implementations of the 2-opt and 3-opt local search algorithms used to solve the Traveling Salesman Problem. The main idea behind them is to take a route that crosses over itself and reorder it so that it does not. It is a very important local search technique. GPU usage greatly decreases the time needed to find the best edges to be swapped in a route. Our results show that at least 90% of the time during Iterated Local Search is spent on the local search itself. We used 13 TSPLIB problem instances with sizes ranging from 100 to 4461 cities for testing. Our results show that by using our GPU algorithm, the time needed to find optimal swaps can be decreased approximately 3 to 26 times compared to parallel CPU code using 32 cores. Additionally, we are pointing out the memory bandwidth limitation problem in current parallel architectures. We are showing that re-computing data is usually faster than reading it from memory in case of multi-core systems and we are proposing this approach as a solution.

asia and south pacific design automation conference | 2009

Aspects of GPU for general purpose high performance computing

Reiji Suda; Takayuki Aoki; Shoichi Hirasawa; Akira Nukada; Hiroki Honda; Satoshi Matsuoka

We discuss hardware and software aspects of GPGPU, specifically focusing on NVIDIA cards and CUDA, from the viewpoints of parallel computing. The major weak points of GPU against newest supercomputers are identified to be and summarized as only four points: large SIMD vector length, small memory, absence of fast L2 cache, and high register spill penalty. As software concerns, we derive optimal scheduling algorithm for latency hiding of host-device data transfer, and discuss SPMD parallelism on GPUs.

ieee international conference on dependable, autonomic and secure computing | 2011

A Performance and Energy Consumption Analytical Model for GPU

Cheng Luo; Reiji Suda

Even with a powerful hardware in parallel execution, it is still difficult to improve the application performance and reduce energy consumption without realizing the performance bottlenecks of parallel programs on GPU architectures. To help programmers have a better insight into the performance and energy-saving bottleneck of parallel applications on GPU architectures, we propose two models: an execution time prediction model and an energy consumption prediction model. The execution time prediction model(ETPM) can estimate the execution time of massively parallel programs which take the instruction-level and thread-level parallelism into consideration. ETPM contains two components: memory sub-model and computation sub-model. The memory sub-model is estimating the cost of memory instructions by considering the number of active threads and GPU memory bandwidth. Correspondingly, the computation sub-model is estimating the cost of computation instructions by considering the number of active threads and the applications arithmetic intensity. We use ocelot to analysis PTX codes to obtain several input parameters for the two sub-models such as the memory transaction number and data size. Basing on the two sub-models, the analytical model can estimates the cost of each instruction while considering instruction-level and thread-level parallelism, thereby estimating the overall execution time of an application. The energy consumption prediction model(ECPM) can estimate the total energy consumption basing on the data from ETPM. We compare the outcome from the models and the actual execution on GTX260 and Tesla C2050. The results show that the models can reach almost 90 percentage accuracy in average for the benchmarks we used.

international workshop on openmp | 2005

Performance evaluation of parallel sparse matrix-vector products on SGI Altix3700

Hisashi Kotakemori; Hidehiko Hasegawa; Tamito Kajiyama; Akira Nukada; Reiji Suda; Akira Nishida

The present paper discusses scalable implementations of sparse matrix-vector products, which are crucial for high performance solutions of large-scale linear equations, on a cc-NUMA machine SGI Altix3700. Three storage formats for sparse matrices are evaluated, and scalability is attained by implementations considering the page allocation mechanism of the NUMA machine. Influences of the cache/memory bus architectures on the optimum choice of the storage format are examined, and scalable converters between storage formats shown to facilitate exploitation of storage formats of higher performance.

international conference on green computing | 2010

Investigation on the power efficiency of multi-core and GPU Processing Element in large scale SIMD computation with CUDA

Da Qi Ren; Reiji Suda

CPU-GPU Processing Element (PE) has become a very popular architecture to construct modern multiprocessing system because of its high performance on massively parallel processing and vector computations. Power dissipation is one of the important factors influencing design development of High Performance Computing (HPC) as a large scale scientific computation may use thousands of processors and hundreds hours of continuous execution that will result enormous energy predicament. Enhancing the utilizations of an individual PE to reach its best computation capability and power efficiency is valuable for saving the overall power cost of large multi-processing systems. Power performance of a CUDA PE is dependent on electrical features of the inside hardware components and their interconnections; also high level applications and the parallel algorithms performed on it. Based on measurements and experimental evaluations, in this work we provide a load sharing method to adjust the workload assignment within the CPU and GPU components inside a CUDA PE in order to optimize the overall power efficiency. The improvement on computation time and power consumption has been validated by examining the program executions when above method is applied on real systems.

computational science and engineering | 2009

Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA

Da Qi Ren; Reiji Suda

Power efficiency is one of the most important issues in high performance computing (HPC) interrelated to both software and hardware. Power dissipation of a program lies on algorithm design and power features of the computer components on which the program runs. In this work, we measure and model the power consumption of large matrices multiplication on multi-core CPU and GPU platform. By incorporating major physical power constrains of hardware components with the analysis of program execution behaviors, we approach to save the overall power consumption by using multithreading CPU to control two GPU devices computing in parallel synchronously. By implementing above method on real system, we show that it can save 22% of energy and speedup the kernel execution time by 71%, compare with solving the same large matrices multiplication using single CPU and GPU combination.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

High Performance GPU Accelerated Local Optimization in TSP

Kamil Rocki; Reiji Suda

This paper presents a high performance GPU accelerated implementation of 2-opt local search algorithm for the Traveling Salesman Problem (TSP). GPU usage significantly decreases the execution time needed for tour optimization, however it also requires a complicated and well tuned implementation. With the problem size growing, the time spent on local optimization comparing the graph edges grows significantly. According to our results based on the instances from the TSPLIB library, the time needed to perform a simple local search operation can be decreased approximately 5 to 45 times compared to a corresponding parallel CPU code implementation using 6 cores. The code has been implemented in OpenCL and as well as in CUDA and tested on AMD and NVIDIA devices. The experimental studies show that the optimization algorithm using the GPU local search converges from up to 300 times faster compared to the sequential CPU version on average, depending on the problem size. The main contributions of this paper are the problem division scheme exploiting data locality which allows to solve arbitrarily big problem instances using GPU and the parallel implementation of the algorithm itself.

Explore More