Taisuke Boku | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Taisuke Boku is active.

Explore More

Publication

Featured researches published by Taisuke Boku.

international parallel and distributed processing symposium | 2006

Profile-based optimization of power performance by using dynamic voltage scaling on a PC cluster

Yoshihiko Hotta; Mitsuhisa Sato; Hideaki Kimura; Satoshi Matsuoka; Taisuke Boku; Daisuke Takahashi

Currently, several of the high performance processors used in a PC cluster have a DVS (dynamic voltage scaling) architecture that can dynamically scale processor voltage and frequency. Adaptive scheduling of the voltage and frequency enables us to reduce power dissipation without a performance slowdown during communication and memory access. In this paper, we propose a method of profiled-based power-performance optimization by DVS scheduling in a high-performance PC cluster. We divide the program execution into several regions and select the best gear for power efficiency. Selecting the best gear is not straightforward since the overhead of DVS transition is not free. We propose an optimization algorithm to select a gear using the execution and power profile by taking the transition overhead into account. We have built and designed a power-profiling system, PowerWatch. With this system we examined the effectiveness of our optimization algorithm on two types of power-scalable clusters (Crusoe and Turion). According to the results of benchmark tests, we achieved almost 40% reduction in terms of EDP (energy-delay product) without performance impact (less than 5%) compared to results using the standard clock frequency.

international conference on cluster computing | 2006

Emprical study on Reducing Energy of Parallel Programs using Slack Reclamation by DVFS in a Power-scalable High Performance Cluster

Hideaki Kimura; Mitsuhisa Sato; Yoshihiko Hotta; Taisuke Boku; Daisuke Takahashi

It has become important to improve the energy efficiency of high performance PC clusters. In PC clusters, high-performance microprocessors have a dynamic voltage and frequency scaling (DVFS) mechanism, which allows the voltage and frequency to be set for reduction in energy consumption. In this paper, we proposed a new algorithm that reduces energy consumption in a parallel program executed on a power-scalable cluster using DVFS. Whenever the computational load is not balanced, parallel programs encounter slack time, that is, they must wait for synchronization of the tasks. Our algorithm reclaims slack time by changing the voltage and frequency, which allows a reduction in energy consumption without impacting on the performance of the program. Our algorithm can be applied to parallel programs represented by a directed acyclic task graph (DAG). It selects an appropriate set of voltages and frequencies (called the gear) that allow the tasks to execute at the lowest frequency that does not increase the overall execution time, but at the same time allows the tasks to be executed as uniformly as possible in frequency. We built two different types of power-scalable clusters using AMD Turion and Transmeta Crusoe. For the empirical study on energy reduction in PC clusters, we designed a toolkit called PowerWatch that includes power monitoring tools and the DVFS control library. This toolkit precisely measures the power consumption of the entire cluster in real time. The experimental results using benchmark problems show that our algorithm reduces energy consumption by 25% with only a 1 % loss in performance

Journal of Computational Physics | 2010

A massively-parallel electronic-structure calculations based on real-space density functional theory

Jun-Ichi Iwata; Daisuke Takahashi; Atsushi Oshiyama; Taisuke Boku; Kenji Shiraishi; Susumu Okada; Kazuhiro Yabana

Abstract Based on the real-space finite-difference method, we have developed a first-principles density functional program that efficiently performs large-scale calculations on massively-parallel computers. In addition to efficient parallel implementation, we also implemented several computational improvements, substantially reducing the computational costs of O ( N 3 ) operations such as the Gram–Schmidt procedure and subspace diagonalization. Using the program on a massively-parallel computer cluster with a theoretical peak performance of several TFLOPS, we perform electronic-structure calculations for a system consisting of over 10,000 Si atoms, and obtain a self-consistent electronic-structure in a few hundred hours. We analyze in detail the costs of the program in terms of computation and of inter-node communications to clarify the efficiency, the applicability, and the possibility for further improvements.

ieee international conference on high performance computing data and analytics | 2011

First-principles calculations of electron states of a silicon nanowire with 100,000 atoms on the K computer

Yukihiro Hasegawa; Jun-Ichi Iwata; Miwako Tsuji; Daisuke Takahashi; Atsushi Oshiyama; Kazuo Minami; Taisuke Boku; Fumiyoshi Shoji; Atsuya Uno; Motoyoshi Kurokawa; Hikaru Inoue; Ikuo Miyoshi; Mitsuo Yokokawa

Real space DFT (RSDFT) is a simulation technique most suitable for massively-parallel architectures to perform first-principles electronic-structure calculations based on density functional theory. We here report unprecedented simulations on the electron states of silicon nanowires with up to 107,292 atoms carried out during the initial performance evaluation phase of the K computer being developed at RIKEN. The RSDFT code has been parallelized and optimized so as to make effective use of the various capabilities of the K computer. Simulation results for the self-consistent electron states of a silicon nanowire with 10,000 atoms were obtained in a run lasting about 24 hours and using 6,144 cores of the K computer. A 3.08 peta-flops sustained performance was measured for one iteration of the SCF calculation in a 107,292-atom Si nanowire calculation using 442,368 cores, which is 43.63% of the peak performance of 7.07 peta-flops.

cluster computing and the grid | 2012

Productivity and Performance of Global-View Programming with XcalableMP PGAS Language

Masahiro Nakao; Jinpil Lee; Taisuke Boku; Mitsuhisa Sato

XcalableMP (XMP) is a PGAS parallel language with a directive-based extension of C and Fortran. While it sup- ports “coarray” as a local-view programming model, an XMP global-view programming model is useful when parallelizing data-parallel programs by adding directives with minimum code modification. This paper considers the productivity and performance of the XMP global-view programming model. In the global-view programming model, a programmer describes data distributions and work-mapping to map the computations to nodes, where the computed data are located. Global-view communication directives are used to move a part of the distributed data globally and to maintain consistency in the shadow area. Rich sets of XMP global-view programming model can reduce the cost for parallelization significantly, and optimization of “privatization” is not necessary. For productivity and performance study, the Omni XMP compiler and the Berkeley Unified Parallel C compiler are used. Experimental results show that XMP can implement the benchmarks with a smaller programming cost than UPC. Furthermore, XMP has higher access performance for global data, which has an affinity with own process than UPC. In addition, the XMP coarray function can effectively tune the applications performance.

international parallel and distributed processing symposium | 2004

Parallel implementation of Strassen's matrix multiplication algorithm for heterogeneous clusters

Yuhsuke Ohtaki; Daisuke Takahashi; Taisuke Boku; Mitsuhisa Sato

Summary form only given. We propose a new distribution scheme for a parallel Strassens matrix multiplication algorithm on heterogeneous clusters. In the heterogeneous clustering environment, appropriate data distribution is the most important factor for achieving maximum overall performance. However, Strassens algorithm reduces the total operation count to about 7/8 times per one recursion and, hence, the recursion level has an effect on the total operation count. Thus, we need to consider not only load balancing but also the recursion level in Strassens algorithm. Our scheme achieves both load balancing and reduction of the total operation count. As a result, we achieve a speedup of nearly 21.7% compared to the conventional parallel Strassens algorithm in a heterogeneous clustering environment.

international conference on computer design | 2000

SCIMA: Software controlled integrated memory architecture for high performance computing

Masaaki Kondo; Hideki Okawara; Hiroshi Nakamura; Taisuke Boku

Processor performance has been improved due to clock acceleration and ILP extraction techniques. Performance of main memory, however, has not been improved so much. The performance gap between processor and memory will be growing further in the future. This is very serious problem in high performance computing because effective performance is limited by memory ability in most cases. In order to overcome this problem, we propose a new VLSI architecture called SCIMA which integrates software controllable memory into a processor chip. Most of data access is regular in high performance computing. The software controllable memory is more suitable for making good use of the regularity than conventional cache. This paper presents its architecture and performance evaluation. The evaluation results reveal the superiority of SCIMA compared with conventional cache-based architecture.

international conference on supercomputing | 1997

CP-PACS: a massively parallel processor for large scale scientific calculations

Taisuke Boku; K. Itakura; Hiroshi Nakamura; Kisaburo Nakazawa

CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processor with 2048 processing units built at Center for Computational Physics, University of Tsukuba. It has an MIMD architecture with distributed memory system. The node processor of CPPACS is a RISC microprocessor enhanced by Pseudo Vector Processing feature, which can realize high-performance vector processing. The interconnection network is 3-dimensional Hyper-Crossbar Network, which has high exibility and embeddability for various network topologies and communication patterns. The theoretical peak performance of whole system is 614.4 GFLOPS. In this paper, we describe the overview of CP-PACS architecture and several special architectural characteristics of it. Then, several performance evaluations both for single node processor and for parallel system are described based on LINPACK and Kernel CG of NAS Parallel Benchmarks. Through these evaluations, the e ectiveness of Pseudo Vector Processing and Hyper-Crossbar Network is shown.

international conference on supercomputing | 1993

A scalar architecture for pseudo vector processing based on slide-windowed registers

Hiroshi Nakamura; Taisuke Boku; Hideo Wada; Hiromitsu Imori; Ikuo Nakata; Yasuhiro Inagami; Kisaburo Nakazawa; Yoshiyuki Yamashita

In this paper, we present a new scalar architecture for high-speed vector processing. Without using cache memory, the proposed architecture tolerates main memory access latency by introducing slide-windowed floating-point registers with data preloading feature and pipelined memory. The architecture can hold upward compatibility with existing scalar architectures. In the new architecture, software can control the window structure. This is the advantage compared with our previous work of register-windows. Because of this advantage, registers are utilized more flexibly and computational efficiency is largely enhanced. Furthermore, this flexibility helps the compiler to generate efficient object codes easily. We have evaluated its performance on Livermore Fortran Kernels. The evaluation results show that the proposed architecture reduces the penalty of main memory access better than an ordinary scalar processor and a processor with cache prefetching. The proposed architecture with 64 registers tolerates memory access latency of 30 CPU cyles. Compared with our previous work, the proposed architecture hides longer memory access latency with fewer registers.

high performance interconnects | 2013

Interconnection Network for Tightly Coupled Accelerators Architecture

Toshihiro Hanawa; Yuetsu Kodama; Taisuke Boku; Mitsuhisa Sato

In recent years, heterogeneous clusters using accelerators have entered widespread use in high-performance computing systems. In such clusters, inter-node communication between accelerators normally requires several memory copies via CPU memory, which results in communication latency that causes severe performance degradation. To address this problem, we propose Tightly Coupled Accelerators (TCA) architecture, which is capable of reducing the communication latency between accelerators over different nodes. In the TCA architecture, PCI Express (PCIe)packets are used for direct inter-node communication between accelerators. In addition, we designed a communication chip that we have named PCI Express Adaptive Communication Hub Version 2 (PEACH2) to realize our proposed TCA architecture. In this paper, we introduce the design and implementation of the PEACH2 chip using a field programmable gate array (FPGA), and present a PEACH2 board designed for use as a PCIe extension board. The results of evaluations using ping-pong programs on an eight node TCA cluster demonstrate that the PEACH2 chip achieves 95% of the theoretical peak performance and a latency of 0.96 μsec.

Explore More