Kazutomo Yoshii | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kazutomo Yoshii is active.

Explore More

Publication

Featured researches published by Kazutomo Yoshii.

acm sigplan symposium on principles and practice of parallel programming | 2008

ZOID: I/O-forwarding infrastructure for petascale architectures

Kamil Iskra; John W. Romein; Kazutomo Yoshii; Peter H. Beckman

The ZeptoOS project is developing an open-source alternative to the proprietary software stacks available on contemporary massively parallel architectures. The aim is to enable computer science research on these architectures, enhance community collaboration, and foster innovation. In this paper, we introduce a component of ZeptoOS called ZOID---an I/O-forwarding infrastructure for architectures such as IBM Blue Gene that decouple file and socket I/O from the compute nodes, shipping those functions to dedicated I/O nodes. Through the use of optimized network protocols and data paths, as well as a multithreaded daemon running on I/O nodes, ZOID provides greater performance than does the stock infrastructure. We present a set of benchmark results that highlight the improvements. Crucially, the flexibility of our infrastructure is a vast improvement over the stock infrastructure, allowing users to forward data using custom-designed application interfaces, through an easy-to-use plug-in mechanism. This capability is used for real-time telescope data transfers, extensively discussed in the paper. Plug-in--specific threads implement prefetching of data obtained over sockets from an input cluster and merge results from individual compute nodes before sending them out, significantly reducing required network bandwidth. This approach allows a ZOID version of the application to handle a larger number of subbands per I/O node, or even to bypass the input cluster altogether, plugging the input from remote receiver stations directly into the I/O nodes. Using the resources more efficiently can result in considerable savings.

international conference on cluster computing | 2006

The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale

Peter H. Beckman; Kamil Iskra; Kazutomo Yoshii; Susan Coghlan

We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence

Cluster Computing | 2008

Benchmarking the effects of operating system interference on extreme-scale parallel machines

Peter H. Beckman; Kamil Iskra; Kazutomo Yoshii; Susan Coghlan; Aroon Nataraj

Abstract We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.

ieee international conference on high performance computing data and analytics | 2010

Accelerating I/O Forwarding in IBM Blue Gene/P Systems

Venkatram Vishwanath; Mark Hereld; Kamil Iskra; Dries Kimpe; Vitali A. Morozov; Michael E. Papka; Robert B. Ross; Kazutomo Yoshii

Current leadership-class machines suffer from a significant imbalance between their computational power and their I/O bandwidth. I/O forwarding is a paradigm that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute nodes to dedicated I/O nodes. I/O forwarding is a critical component of the I/O subsystem of the IBM Blue Gene/P supercomputer currently deployed at several leadership computing facilities. In this paper, we evaluate the performance of the existing I/O forwarding mechanisms for BG/P and identify the performance bottlenecks in the current design. We augment the I/O forwarding with two approaches: I/O scheduling using a work-queue model and asynchronous data staging. We evaluate the efficacy of our approaches using microbenchmarks and application-level benchmarks on leadership class systems.

Operating Systems Review | 2006

Operating system issues for petascale systems

Peter H. Beckman; Kamil Iskra; Kazutomo Yoshii; Susan Coghlan

Petascale supercomputers will be available by 2008. The largest machine of these complex leadership-class machines will probably have nearly 250K CPUs. These massively parallel systems have a number of challenging operating system issues. In this paper, we focus on the issues most important for the system that will first breach the petaflop barrier: synchronization and collective operations, parallel I/O, and fault tolerance.

international conference on parallel processing | 2009

Characterizing the Performance of Big Memory on Blue Gene Linux

Kazutomo Yoshii; Kamil Iskra; Harish Gapanati Naik; Pete Beckmanm; P. Chris Broekema

Efficient use of Linux for high-performance applications on Blue Gene/P (BG/P) compute nodes is challenging because of severe performance hits resulting from translation lookaside buffer (TLB) misses and a hard-to-program torus network DMA controller. To address these difficulties, we present the design and implementation of “Big Memory”— an alternative, transparent memory space for computational processes. Big Memory uses extremely large memory pages available on PowerPC CPUs to create a TLB-miss-free, flat memory area that can be used for application code and data and is easier to use for DMA operations. One of our singlenode memory benchmarks shows that the performance gap between regular PowerPC Linux with 4KB pages and IBM BG/P compute node kernel (CNK) is about 68% in the worst case. Big Memory narrows the worst case performance gap to just 0.04%. We verify this result on 1024 nodes of Blue Gene/P using the NAS Parallel Benchmarks and find the performance under Linux with Big Memory to fluctuate within 0.7% of CNK. Originally intended exclusively for compute node tasks, our new memory subsystem turns out to dramatically improve the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray (LOFAR) radio telescope as an example.

international conference on cluster computing | 2012

Evaluating Power-Monitoring Capabilities on IBM Blue Gene/P and Blue Gene/Q

Kazutomo Yoshii; Kamil Iskra; Rinku Gupta; Peter H. Beckman; Venkatram Vishwanath; Chenjie Yu; Susan Coghlan

Power consumption is becoming a critical factor as we continue our quest toward exascale computing. Yet, actual power utilization of a complete system is an insufficiently studied research area. Estimating the power consumption of a large scale system is a nontrivial task because a large number of components are involved and because power requirements are affected by the (unpredictable) workloads. Clearly needed is a power-monitoring infrastructure that can provide timely and accurate feedback to system developers and application writers so that they can optimize the use of this precious resource. Many existing large-scale installations do feature power-monitoring sensors, however, those are part of environmental- and health monitoring sub systems and were not designed with application level power consumption measurements in mind. In this paper, we evaluate the existing power monitoring of IBM Blue Gene systems, with the goal of understanding what capabilities are available and how they fare with respect to spatial and temporal resolution, accuracy, latency, and other characteristics. We find that with a careful choice of dedicated micro benchmarks, we can obtain meaningful power consumption data even on Blue Gene/P, where the interval between available data points is measured in minutes. We next evaluate the monitoring subsystem on Blue Gene/Q, and are able to study the power characteristics of FPU and memory subsystems of Blue Gene/Q. We find the monitoring subsystem capable of providing second-scale resolution of power data conveniently separated between node components with seven seconds latency. This represents a significant improvement in power monitoring infrastructure, and hope future systems will enable real-time power measurement in order to better understand application behavior at a finer granularity.

ieee international conference on high performance computing data and analytics | 2011

Performance and Scalability Evaluation of 'Big Memory' on Blue Gene Linux

Kazutomo Yoshii; Kamil Iskra; Harish Gapanati Naik; Peter H. Beckman; P. Chris Broekema

We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of ‘Big Memory’——an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solely for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.

ieee international conference on cloud engineering | 2015

A Container-Based Approach to OS Specialization for Exascale Computing

Judicael Zounmevo; Swann Perarnau; Kamil Iskra; Kazutomo Yoshii; Roberto Gioiosa; Brian Van Essen; Maya Gokhale; Edgar A. León

Future exascale systems will impose several conflicting challenges on the operating system (OS) running on the compute nodes of such machines. On the one hand, the targeted extreme scale requires the kind of high resource usage efficiency that is best provided by lightweight OSes. At the same time, substantial changes in hardware are expected for exascale systems. Compute nodes are expected to host a mix of general-purpose and special-purpose processors or accelerators tailored for serial, parallel, compute-intensive, or I/O-intensive workloads. Similarly, the deeper and more complex memory hierarchy will expose multiple coherence domains and NUMA nodes in addition to incorporating nonvolatile RAM. That expected workload and hardware heterogeneity and complexity is not compatible with the simplicity that characterizes high performance lightweight kernels. In this work, we describe the Argo Exascale node OS, which is our approach to providing in a single kernel the required OS environments for the two aforementioned conflicting goals. We resort to multiple OS specializations on top of a single Linux kernel coupled with multiple containers.

international parallel and distributed processing symposium | 2016

Systemwide Power Management with Argo

Daniel A. Ellsworth; Tapasya Patki; Swann Perarnau; Sangmin Seo; Abdelhalim Amer; Judicael Zounmevo; Rinku Gupta; Kazutomo Yoshii; Henry Hoffman; Allen D. Malony; Martin Schulz; Pete Beckman

The Argo project is a DOE initiative for designing a modular operating system/runtime for the next generation of supercomputers. A key focus area in this project is power management, which is one of the main challenges on the path to exascale. In this paper, we discuss ideas for systemwide power management in the Argo project. We present a hierarchical and scalable approach to maintain a power bound at scale, and we highlight some early results.

Explore More