Kohta Nakashima | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kohta Nakashima is active.

Explore More

Publication

Featured researches published by Kohta Nakashima.

international symposium on computing and networking | 2014

Unified Performance Profiling of an Entire Virtualized Environment

Masao Yamamoto; Miyuki Ono; Kohta Nakashima; Akira Hirai

Performance analysis and troubleshooting of cloud applications are challenging. In particular, identifying the root causes of performance problems is quite difficult. This is because profiling tools based on processor performance counters do not yet work well for an entire virtualized environment, which is the underlying infrastructure in cloud computing. In this work, we explore an approach for unified performance profiling of an entire virtual environment by sampling only at the virtual machine monitor (VMM) level and applying common-time-based analysis across the entire virtual environment from a VMM to all guests on a host machine. Our approach involves three steps: centralized data sampling at VMM-level, generation of symbol map for running programs in guests, and unified analysis of the entire virtualized environment with common time by the host-time-axis. We also describe the design of unified profiling for an entire virtual machine (VM) environment, and we actually implement a unified VM profiler based on hardware performance counters. Finally, our results demonstrate accurate profiling. In addition, we achieved a lower overhead than in a previous study as a result of having no additional context switches by the virtual interrupt injection into the guest during measurement.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

The Design of Seamless MPI Computing Environment for Commodity-Based Clusters

Shinji Sumimoto; Kohta Nakashima; Akira Naruse; Kouichi Kumon; Takashi Yasui; Yoshikazu Kamoshida; Hiroya Matsuba; Atsushi Hori; Yutaka Ishikawa

This paper describes the design and implementation of a seamless MPI runtime environment, called MPI-Adapter, that realizes MPI program binary portability in different MPI runtime environments. MPI-Adapter enables an MPI binary program to run on different MPI implementations. It is implemented as a dynamic loadable module so that the module dynamically captures all MPI function calls and invokes functions defined in a different MPI implementation using the data type translation techniques. A prototype system was implemented for Linux PC clusters to evaluate the effectiveness of MPI-Adapter. The results of an evaluation on a Xeon Processor (3.8GHz) based cluster show that the MPI translation overhead of MPI sending (receiving) is around 0.028μs , and the performance degradation of MPI-Adapter is negligibly small on the NAS parallel benchmark IS.

international conference on performance engineering | 2016

Execution Time Compensation for Cloud Applications by Subtracting Steal Time based on Host-Level Sampling

Masao Yamamoto; Kohta Nakashima

Accurate measurement of program execution time is indispensable to time-based charge systems and performance debugging in all computer systems. However, cloud application execution time cannot be measured properly because measurement in a virtual machine (VM) includes additional time called steal time. The steal time of each program in a VM is unrecognizable by existing standard operating system (OS) tools. Therefore, it is quite difficult for performance engineers to grasp the accurate execution time of each program in a VM. In this ongoing work, we show the novel point of steal in the broad sense and describe how to compensate for function-level execution time in each program in a VM. Our novel approach works by subtracting steal time, which is based on the time-series data of host-level sampling in each function. We implement our approach as a host-level kernel module based on hardware performance counters and some user-level analysis programs. Therefore, our method requires no modification of user applications, guest OSes, a virtual machine monitor (VMM) or a host OS. Finally, our results demonstrate accurate execution time of a function-level guest program, with an overhead lower than 1% for practical use.

parallel, distributed and network-based processing | 2015

Progression of MPI Non-blocking Collective Operations Using Hyper-Threading

Masahiro Miwa; Kohta Nakashima

MPI non-blocking collective operations offer a high level interface to MPI library users, and potentially allow communication to be overlapped with calculation. Progression, which controls communications running in the background of the calculation, is the key factor to achieve an efficient overlap. The most commonly used progression method is manual progression, in which a progression function is called in the main calculation. In manual progression, MPI library users have to estimate the communication timing to maximize the overlap effect and thus to manage the complex communication optimization. An alternative approach for progression is the use of separate communication threads. By using communication threads, communication calculation overlap can be achieved simply. However, context switches between the calculation thread and the communication thread cause lower performance in the frequent case where all cores are used for calculation. In this paper, we propose a novel threaded progression method using Hyper-Threading to maximize the overlap effect of non-blocking collective operations. We apply MONITOR/MWAIT instructions to the communication thread on Hyper-Threading so as not to degrade the calculation thread due to shared core resource conflict. Evaluation on 8-node Infini Band connected IA server clustered systems confirmed that the latency is suppressed to a small level and that our approach has an advantage over manual progression in terms of communication-calculation overlap. Using a real application of CG benchmark, our method achieved 32% reduction in execution time compared to using blocking collective operation, and that is nearly perfect overlap. Although manual progression also achieved perfect overlap, our method has the advantage that no communication timing tuning is required for each application.

international conference on supercomputing | 2014

Hardware-assisted scalable flow control of shared receive queue

Teruo Tanimoto; Takatsugu Ono; Kohta Nakashima; Takashi Miyoshi

The total number of processor cores in supercomputers is increasing while memory size per core is decreasing due to the adoption of processors with multiple cores. Shared Receive Queue is a technique that effectively reduces the memory usage of buffers, but the absence of flow control results in excess buffer pools. We propose a hardware-assisted flow control that reduces flow control latency by 95.1%, thus enabling scalable supercomputers with multi-core processors.

ieee/acm international symposium cluster, cloud and grid computing | 2013

Interference-aware Incoming Message Detection for MPI Threaded Progression

Masahiro Miwa; Kohta Nakashima; Akira Naruse

To enable overlap of computation and communication with non-blocking collective communication, it is required to progress asynchronously a sequence of communications. One of the naive implementation is to use a separate thread for communitation and run it in back of computation thread. However if the total number of threads is greater than the number of physical cores, context switches cause performance degradation of the computation thread. Simultaneous MultiThread (SMT) can be used to avoid this problem. However, commonly-used busy polling for incoming message detection also causes performance degradation of the computation thread. In this paper, we propose incoming message detection method using MONITOR/MWAIT instructions to reduce the performance degradation. Experiment results show that the performance of computation thread is improved largely compared to the busy polling method while latency is suppressed by a small increase.

Archive | 2012