Hiroko Midorikawa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hiroko Midorikawa is active.

Explore More

Publication

Featured researches published by Hiroko Midorikawa.

international conference on cluster computing | 2008

DLM: A distributed Large Memory System using remote memory swapping over cluster nodes

Hiroko Midorikawa; Motoyoshi Kurokawa; Ryutaro Himeno; Mitsuhisa Sato

Emerging 64 bitOSpsilas supply a huge amount of memory address space that is essential for new applications using very large data. It is expected that the memory in connected nodes can be used to store swapped pages efficiently, especially in a dedicated cluster which has a high-speed network such as 10 GbE and Infiniband. In this paper, we propose the distributed large memory system (DLM), which provides very large virtual memory by using remote memory distributed over the nodes in a cluster. The performance of DLM programs using remote memory is compared to ordinary programs using local memory. The results of STREAM, NPB and Himeno benchmarks show that the DLM achieves better performance than other remote paging schemes using a block swap device to access remote memory. In addition to performance, DLM offers the advantages of easy availability and high portability, because it is a user-level software without the need for special hardware. To obtain high performance, the DLM can tune its parameters independently from kernel swap parameters. We also found that DLMpsilas independence of kernel swapping provides more stable behavior.

high performance computing systems and applications | 2014

An evaluation of the potential of flash SSD as large and slow memory for stencil computations

Hiroko Midorikawa; Hideyuki Tan; Toshio Endo

This paper investigates the potential of flash as a large and slow memory behind dynamic random-access memory (DRAM) for stencil computation, which is one of the most common and important computation kernels in various scientific and engineering simulations. We evaluate the performance of a fastswap kernel, which was recently incorporated into Linux, in stencil computation using flash as a swap device. Moreover, we propose a locality-aware, hierarchical out-of-core computation algorithm by employing data structure blocking techniques in stencil computations to bridge the DRAM-flash latency divide. We find that 7-point and 27-point stencil computations for a 1-TiB problem size (32 times that of the DRAM), using only a 32-GiB DRAM and a flash solid-state drive (SSD), in Mflops attain 24% and 47%, respectively, of the performance achieved in execution using only DRAM.

international conference on cluster computing | 2009

Using a cluster as a memory resource: A fast and large virtual memory on MPI

Hiroko Midorikawa; Kazuhiro Saito; Mitsuhisa Sato; Taisuke Boku

The 64-bit OS provides ample memory address space that is beneficial for applications using a large amount of data. This paper proposes using a cluster as a memory resource for sequential applications requiring a large amount of memory. This system is an extension of our previously proposed socket-based Distributed Large Memory System (DLM), which offers large virtual memory by using remote memory distributed over nodes in a cluster. The newly designed DLM is based on MPI (Message Passing Interface) to exploit higher portability. MPI-based DLM provides fast and large virtual memory on widely available open clusters managed with an MPI batch queuing system. To access this remote memory, we rely on swap protocols adequate for MPI thread support levels. In experiments, we confirmed that it achieves 493 MB/s and 613 MB/s of remote memory bandwidth with the STREAM benchmark on 2.5 GB/s and 5 GB/s links (Myri-10G x2, x4) and high performance of applications with NPB and Himeno benchmarks. Additionally, this system enables users unfamiliar with parallel programming to use a cluster.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Locality-Aware Stencil Computations Using Flash SSDs as Main Memory Extension

Hiroko Midorikawa; Hideyuki Tan

This paper investigates the performance of flash solid state drives (SSDs) as an extension to main memory with a locality-aware algorithm for stencil computations. We propose three different configurations, swap, m map, and aio, for accessing the flash media, with data structure blocking techniques. Our results indicate that hierarchical blocking optimizations for three tiers, flash SSD, DRAM, and cache, perform satisfactorily to bridge the DRAM-flash latency divide. Using only 32 GiB of DRAM and a flash SSD, with 7-point stencil computations for a 512 GiB problem (16 times that of the DRAM), 87% of the Mflops execution performance achieved with DRAM only was attained.

pacific rim conference on communications, computers and signal processing | 2001

The design and implementation of user-level software distributed shared memory system: SMS - implicit binding entry consistency model

Hiroko Midorikawa; Yusuke Ohashi; Hajime Iizuka

A distributed shared memory system, named SMS, is a user-level software system. It provides a virtual shared memory environment on a computer cluster consisting of computers connected by a communication network. Although the SMS requires only commodity hardware and software, it enables users to write parallel programs under a shared memory programming model.

international parallel and distributed processing symposium | 2016

Blk-Tune: Blocking Parameter Auto-Tuning to Minimize Input-Output Traffic for Flash-Based Out-of-Core Stencil Computations

Hiroko Midorikawa

This paper proposes the auto-tuning system designed for flash-based out-of-core stencil computations. Blk-Tune is a runtime blocking parameter auto-tuning system that enables the use of flash memory as an extension of main memory. It incorporates automatic hardware information retrieval using Portable Hardware Locality and minimizes the amount of data transferred between the flash device and DRAM, which is the most dominant factor affecting the performance of out-of-core algorithms using flash. The use of explicit highly parallel asynchronous I/O to a flash device together with this auto-tuning system offers great advantages over the mmap method, in which a flash file is memory mapped. Blk-Tune allows users to easily achieve maximum performance of large-scale stencil computations in different hardware and application settings.

international conference on parallel and distributed systems | 2016

Evaluation of Flash-Based Out-of-Core Stencil Computation Algorithms for SSD-Equipped Clusters

Hiroko Midorikawa; Hideyuki Tan

This paper proposes a new scheme for solving data size requirements for a large-scale stencil computation, which are greater than the total size of the main memories of nodes in a cluster. It utilizes distributed flash SSDs over cluster nodes as an extension to the main memory with a locality-aware algorithm. Three algorithms with a different hierarchical blocking scheme for three memory tiers, namely, flash SSD, DRAM, and cache, are proposed, and they are evaluated in different platforms and flash devices. They utilize not only highly parallel asynchronous input/output in flash SSDs, but also appropriate blocking parameters by using an auto-tuning system named Blk-Tune. They also overcome the performance degradation caused by the non-uniform memory architecture (NUMA). The optimized algorithms for single nodes are extended for multi-nodes and evaluated in a cluster with traditional SATA SSDs, as well as with state-of-the-art flash devices, such as low-power and cost-effective M.2 NVMe flash SSDs. With the use of our scheme and distributed flash devices in a cluster, large-scale stencil problems can be solved with a limited number of nodes and a moderate size of main memories.

ieee/acm international symposium cluster, cloud and grid computing | 2013

User-Level Remote Memory Paging for Multithreaded Applications

Hiroko Midorikawa; Yuichiro Suzuki; Masatoshi Iwaida

The new page swap mechanism is introduced to resolve an inconsistent page problem for multithreaded applications in user-level remote paging systems. According to the evaluations, its overhead is limited and it can be applicable to actual use for multithreaded applications.

cluster computing and the grid | 2012

Automatic Adaptive Page-Size Control for Remote Memory Paging

Hiroko Midorikawa; Joe Uchiyama

An automatic adaptive page size control methodology is proposed for remote memory paging. It estimates a working data set and changes page size dynamically and adaptively to each processing part of an application during it is running. It is highly effective to prevent memory server thrashing when the size of local memory is limited.

pacific rim conference on communications, computers and signal processing | 2009

Page replacement algorithm using swap-in history for remote memory paging

Kazuhiro Saito; Hiroko Midorikawa; Munenori Kai

The Distributed Large Memory system, DLM, was designed to provide a larger size of memory beyond that of local physical memory by using remote memory distributed over cluster nodes. The original DLM adopted a low cost page replacement algorithm which selects an evicted page in address order. In the DLM, the remote page swapping is the most critical in performance. For more efficient swap-out page selection, we propose a new page replacement algorithm which pays attention to swap-in history. The LRU and other algorithms which use the memory access history generate more overhead for user-level software to record memory accesses. On the other hand, using swap-in history generates little costs. According to our performance evaluation, the new algorithm reduces the number of the remote swapping in the maximum by 32% and gains 2.7 times higher performance in real application, Cluster3.0. In this paper, we describe the design of the new page replacement algorithm and evaluate performances in several applications, including NPB and HimenoBmk.

Explore More