Kazuhiko Komatsu
Tohoku University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kazuhiko Komatsu.
parallel and distributed computing: applications and technologies | 2009
Hiroyuki Takizawa; Katsuto Sato; Kazuhiko Komatsu; Hiroaki Kobayashi
In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.
international parallel and distributed processing symposium | 2011
Hiroyuki Takizawa; Kentaro Koyama; Katsuto Sato; Kazuhiko Komatsu; Hiroaki Kobayashi
In this paper, we propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing. CheCL can perform CPR on an OpenCL application program without any modification and recompilation of its code. A conventional check pointing system fails to checkpoint a process if the process uses OpenCL. Therefore, in CheCL, every API call is forwarded to another process called an API proxy, and the API proxy invokes the API function, two processes, an application process and an API proxy, are launched for an OpenCL application. In this case, as the application process is not an OpenCL process but a standard process, it can be safely check pointed. While CheCL intercepts all API calls, it records the information necessary for restoring OpenCL objects. The application process does not hold any OpenCL handles, but CheCL handles to keep such information. Those handles are automatically converted to OpenCL handles and then passed to API functions. Upon restart, OpenCL objects are automatically restored based on the recorded information. This paper demonstrates the feasibility of transparent check pointing of OpenCL programs including MPI applications, and quantitatively evaluates the runtime overheads. It is also discussed that CheCL can enable process migration of OpenCL applications among distinct nodes, and among different kinds of compute devices such as a CPU and a GPU.
international symposium on parallel and distributed processing and applications | 2011
Katsuto Sato; Kazuhiko Komatsu; Hiroyuki Takizawa; Hiroaki Kobayashi
In this paper, we propose a runtime performance prediction model for automatic selection of accelerators to execute kernels in OpenCL. The proposed method is a history-based approach that uses profile data for performance prediction. The profile data are classified into some groups, from each of which its own performance model is derived. As the execution time of a kernel depends on some runtime parameters such as kernel arguments, the proposed method first identifies parameters affecting the execution time by calculating the correlation between each parameter and the execution time. A parameter with weak correlation is used for the classification of the profile data and the selection of the performance prediction model. A parameter with strong correlation is used for building a linear model for the prediction of the kernel execution time by using only the classified profile data. Experimental results clearly indicate that the proposed method can achieve more accurate performance prediction than conventional history-based approaches because of the profile data classification.
ieee region 10 conference | 2012
Alfian Amrizal; Shoichi Hirasawa; Kazuhiko Komatsu; Hiroyuki Takizawa; Hiroaki Kobayashi
As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing nodes local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.
Software Automatic Tuning, From Concepts to State-of-the-Art Results | 2011
Katsuto Sato; Hiroyuki Takizawa; Kazuhiko Komatsu; Hiroaki Kobayashi
Recently, Compute Unified Device Architecture (CUDA) has enabled Graphics Processing Units (GPUs) to accelerate various applications. However, to exploit the GPU’s computing power fully, a programmer has to carefully adjust some CUDA execution parameters even for simple stencil processing kernels. Hence, this paper develops an automatic parameter tuning mechanism based on profiling to predict the optimal execution parameters. This paper first discusses the scope of the parameter exploration space determined by GPU’s architectural restrictions. To find the optimal execution parameters, performance models are created by profiling execution times of kernel using each promising parameter configuration. The execution parameters are determined by using those performance models. This paper evaluates the performance improvement due to the proposed mechanism using two benchmark programs. From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.
Archive | 2013
Kazuhiko Komatsu; Takashi Soga; Ryusuke Egawa; Hiroyuki Takizawa; Hiroaki Kobayashi
The Building-Cube Method (BCM) has been proposed as a new CFD method for an efficient three-dimensional flow simulation on large-scale supercomputing systems, and is based on equally-spaced Cartesian meshes. As a flow domain can be divided into equally-partitioned cells due to the equally-spaced meshes, the flow computations can be divided to partial computations of the same computational cost. To achieve a high sustained performance, architecture-aware implementations and optimizations considering characteristics of supercomputing systems are essential because there have been various types of supercomputing systems such as a scalar type, a vector type, and an accelerator type. This paper discusses the architecture-aware implementations and optimizations for various supercomputing systems such as an Intel Nehalem-EP cluster, an Intel Nehalem-EX cluster, Fujitsu FX-1, Hitachi SR16000 M1, NEC SX-9, and a GPU cluster, and analyses their sustained performance for BCM. The performance analysis shows that memory and network capabilities largely affect the performance of BCM rather than computational potentials.
field-programmable logic and applications | 2006
Yoshiyuki Kaeriyama; Daichi Zaitsu; Kazuhiko Komatsu; Ken-Ichi Suzuki; Tadao Nakamura; Nobuyuki Ohba
Ray tracing is a global illumination based rendering method widely used in computer graphics. Although it generates photo-realistic images, it requires a large number of computations. In ray tracing, the ray-object intersection test is one of the dominant factors for the processing speed. To accelerate the intersection test, we propose a new method based on a plane-sphere intersection algorithm, and show a hardware system using an FPGA. The computations used in the method are highly pipelined and parallelized by optimizing the balance between the computation speed and the memory data bandwidth. As a result, the prototype makes full use of 512 DSP cores built in Xilinx Vertex-4 SX FPGA, and the average utilization of the DSP cores is close to 90%. The simulation results show that the proposed system running at 160MHz performs the intersection test a few hundred times faster than a commodity PC with a 3.4GHz Pentium 4
Computers & Fluids | 2011
Takashi Soga; Akihiro Musa; Koki Okabe; Kazuhiko Komatsu; Ryusuke Egawa; Hiroyuki Takizawa; Hiroaki Kobayashi; Shun Takahashi; Daisuke Sasaki; Kazuhiro Nakahashi
Ipsj Online Transactions | 2008
Kazuhiko Komatsu; Yoshiyuki Kaeriyama; Ken-Ichi Suzuki; Hiroyuki Takizawa; Hiroaki Kobayashi
IEICE Transactions on Information and Systems | 2010
Ken-Ichi Suzuki; Yoshiyuki Kaeriyama; Kazuhiko Komatsu; Ryusuke Egawa; Nobuyuki Ohba; Hiroaki Kobayashi