Satoshi Matsushita | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Satoshi Matsushita is active.

Explore More

Publication

Featured researches published by Satoshi Matsushita.

international symposium on microarchitecture | 2005

Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities

Taku Ohsawa; Masamichi Takagi; Shoji Kawahara; Satoshi Matsushita

We propose a speculative multi-threading processor architecture called Pinot. Pinot exploits parallelism over a wide range of granularities without modifying program sources. Since exploitation of fine-grain parallelism suffers from limits of parallelism and overhead incurred by parallelization, it is better to extract coarse-grain parallelism. Coarse-grain parallelism is biased in some programs (mainly, numerical ones) and some program portions. Therefore, exploiting both coarse- and fine-grain parallelism is a key to the performance of speculative multithreading. The features of Pinot are as follows: (1) A parallelizing tool extracts parallelism at any level of granularity (e.g. even ten thousand instructions) from any program sub-structures (e.g. loops, calls, or basic blocks). The tool utilizes formulation in which the parallelization process is reduced to a combinatorial optimization problem. (2) A parallel execution model with extension of thread control instructions is designed in order to minimize the increase of the dynamic instruction count. The model employs implicit thread termination and cancellation, as well as register value transfer without synchronization. (3) A versioning cache called version resolution cache (VRC) accomplishes both coarse- and fine-grained speculative multithreading. VRC operates as a large buffer for coarse-grained multi-threading. In addition, it provides low latency inter-thread communication with an update-based protocol for fine-grained multi-threading. We performed cycle-accurate simulations with 38 programs from the SPEC and MiBench benchmarks. The speedup with 4-processor-element-Pinot is up to 3.7 times, and 2.2 times on geometric mean against a conventional processor. The speedup in a program (susan) drops from 3.7 to 1.6 when the speculative buffer size is limited to 256 bytes. It confirms that exploiting coarse-grain parallelism is essential to the improved performance. FPGA implementation shows 32% overhead of area and 12% increase of critical path delay compared to a conventional processor.

symposium on operating systems principles | 2015

Implementing linearizability at large scale and low latency

Collin Lee; Seo Jin Park; Ankita Kejriwal; Satoshi Matsushita; John K. Ousterhout

Linearizability is the strongest form of consistency for concurrent systems, but most large-scale storage systems settle for weaker forms of consistency. RIFL provides a general-purpose mechanism for converting at-least-once RPC semantics to exactly-once semantics, thereby making it easy to turn non-linearizable operations into linearizable ones. RIFL is designed for large-scale systems and is lightweight enough to be used in low-latency environments. RIFL handles data migration by associating linearizability metadata with objects in the underlying store and migrating metadata with the corresponding objects. It uses a lease mechanism to implement garbage collection for metadata. We have implemented RIFL in the RAMCloud storage system and used it to make basic operations such as writes and atomic increments linearizable; RIFL adds only 530 ns to the 13.5 μs base latency for durable writes. We also used RIFL to construct a new multi-object transaction mechanism in RAMCloud; RIFLs facilities significantly simplified the transaction implementation. The transaction mechanism can commit simple distributed transactions in about 20 μs and it outperforms the H-Store main-memory database system for the TPC-C benchmark.

international symposium on microarchitecture | 2000

A single-chip multiprocessor for smart terminals

Masato Edahiro; Satoshi Matsushita; Masakazu Yamashina; Naoki Nishi

Merlot, the first MP98 architecture prototype, promises 1-GIPS performance at 1 watt for 1.3-V operations in support of smart 21st-century information terminals.

parallel computing | 1992

Plasma simulator METIS for tokamak confinement and heating studies

Tatsuoki Takeda; Keiji Tani; Toshihide Tsunematsu; Yasuaki Kishimoto; G. Kurita; Satoshi Matsushita; Toshiyuki Nakata

Abstract To fill up a theoretical database necessary for the fusion reactor development of program a plasma simulator METIS was designed and a prototype plasma simulator ProtoMETIS was constructed. METIS is projected on the basis of a MIMD type parallel computer composed of 250 processor elements with distributed memories and optimized for analyses of the nonlinear MHD behavior of a plasma and the loss of alpha particles due to magnetic field ripples in a tokamak. By using ProtoMETIS performance of the METIS architecture was investigated for the above problems and satisfactory results were attained. It was also confirmed that a simulation of a free electron laser used for plasma heating and an MHD equilibrium computation of a tokamak plasma were carried out efficiently on the plasma simulator.

international symposium on systems synthesis | 2002

Design experience of a chip multiprocessor Merlot and expectation to functional verification

Satoshi Matsushita

We have fabricated a Chip Multiprocessor prototype code-named Merlot to proof our novel speculative multithreading architecture. On Merlot, multiple threads provide wider issue window beyond ordinal instruction level parallel (ILP) processors like superscalar or VLIW. With the architecture, we estimate 3.0 times speedup against single processing elements (PE) on speech recognition code and IDCT code with four PEs. Merlot integrates on-chip devices, PCI interface, and SDRAM interfaces. We have encountered design issues of chip multiprocessor and SoC design. We have successfully run parallelized mpeg3 decoder on the first silicon with several software workarounds, thanks to functional verification environment including system modeling on RTL. However, bugs found in later stage of design have required larger manpower or delay of project. In this paper, we also discuss the methodology to improve functional verification coverage, and expect the solution in formal approaches.

Archive | 2002