Shimpei Sato | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shimpei Sato is active.

Explore More

Publication

Featured researches published by Shimpei Sato.

parallel and distributed computing: applications and technologies | 2009

A Study of an Infrastructure for Research and Development of Many-Core Processors

Koh Uehara; Shimpei Sato; Takefumi Miyoshi; Kenji Kise

Many-core processors which have thousands of cores on a chip will be realized. We developed an infrastructure which accelerates the research and development of such many-core processors. This paper describes three main elements provided by our infrastructure. The first element is the definition of simple many-core processor architecture called M-Core. The second is SimMc, a software simulator of M-Core. The third is the software library MClib which helps the development of application programs for M-Core. The simulation speed of SimMc and the parallelization efficiency of M-Core are evaluated using some benchmark programs. We show that our infrastructure accelerates the research and development of many-core processors.

symposium on vlsi circuits | 2017

BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS

Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe; Tetsuya Asai; Shinya Takamaeda-Yamazaki; Tadahiro Kuroda; Masato Motomura

A versatile reconfigurable accelerator for binary/ternary deep neural networks (DNNs) is presented. It features a massively parallel in-memory processing architecture and stores varieties of binary/ternary DNNs with a maximum of 13 layers, 4.2 K neurons, and 0.8 M synapses on chip. The 0.6 W, 1.4 TOPS chip achieves performance and energy efficiency that is 10–102 and 102–104 times better than a CPU/GPU/FPGA.

field programmable logic and applications | 2015

Ultra-fast NoC emulation on a single FPGA

Thiem Van Chu; Shimpei Sato; Kenji Kise

Network-on-Chip (NoC) has become the de facto on-chip communication architecture for many-core systems. This paper proposes novel methods for emulating large-scale NoC designs on a single FPGA. Since FPGAs offer a highly parallel platform, FPGA-based emulation can be much faster than the software-based approach. However, emulating NoC designs with up to thousands of nodes is a challenging task due to the FPGA capacity constraints. We first describe how to accurately model synthetic workloads on FPGA by separating the time of the emulated network and the times of the traffic generation units. We next present a novel use of time-multiplexing in emulating the entire network using several physical nodes. Finally, we show the basic steps to apply the proposed methods to emulate different NoC architectures. The proposed methods enable ultrafast emulations of large-scale NoC designs with up to thousands of nodes using only on-chip resources of a single FPGA. In particular, more than 5,000× simulation speedup over BookSim, a widely used software-based NoC simulator, is achieved.

field-programmable custom computing machines | 2015

Enabling Fast and Accurate Emulation of Large-Scale Network on Chip Architectures on a Single FPGA

Thiem Van Chu; Shimpei Sato; Kenji Kise

Network on Chip (NoC) has become the de facto on-chip communication architecture of many-core systems. This paper proposes an FPGA-based NoC emulator which can achieve an ultra-fast simulation speed. We improve the scalability of the NoC emulator without simplifying the emulated architectures or using off-chip resources. We introduce a novel method which enables to accurately emulate NoC designs under synthetic workloads without using a large amount of memory by decoupling the time of the emulated NoC and the time of the traffic generators. Additionally, we propose a method based on the time-division multiplexing technique to emulate the behavior of the entire network using several physical nodes while effectively using FPGA resources. We show that an implementation of the proposed NoC emulator on a Virtex-7 FPGA can achieve 2, 745x simulation speedup over Booksim, one of the most widely used software-based NoC simulator, while maintaining the simulation accuracy.

applied reconfigurable computing | 2015

ArchHDL: A Novel Hardware RTL Design Environment in C++

Shimpei Sato; Kenji Kise

LSIs are designed in four stages including architectural design, logic design, circuit design, and physical design. In the architectural design and the logic design, designers describe a hardware in RTL. However, they generally use different languages. Typically a general purpose programming language such as C or C++ and a hardware description language such as Verilog HDL or VHDL are used in the architectural design and the logic design, respectively. In this paper, we propose a new hardware description environment for the architectural design and logic design which aims to describe and verify a hardware in one language. The environment consists of (1) a new hardware description language called ArchHDL which enables to simulate a hardware faster than Verilog HDL simulation and (2) a source code translation tool from ArchHDL to Verilog HDL. ArchHDL is a new language for hardware RTL modeling based on C++. The key features of this language are that (1) designers describe a combinational circuit as a function and (2) the ArchHDL library implements non-blocking assignment in C++. Using these features, designers are able to write a hardware in a Verilog HDL-like style. The source code of ArchHDL is able to convert to Verilog HDL by the translation tool and is able to synthesize for an FPGA or an ASIC. We implemented a many-core processor in ArchHDL. The simulation speed for the processor by ArchHDL achieves about 4.5 times faster than the simulation speed by Synopsys VCS. We also convert the code to Verilog HDL and estimated the hardware resources on an FPGA. To implement the 48-node many-core processor, it needs 71 % of entire resources of Virtex-7.

ACM Sigarch Computer Architecture News | 2013

The Ultrasmall soft processor

Yuichiroh Tanaka; Shimpei Sato; Kenji Kise

A soft processor is a processor that is implemented using logic synthesis mainly targeting programmable logic device like FPGA and it becomes a common component for FPGA designs. The supersmall soft processor (small-core) developed at University of Toronto is an unique soft processor because its main concern is very low hardware cost while supporting 32-bit ISA. With the same concept as small-core, we are developing the ultrasmall soft processor (UltraSmall) based on smallcore. The goal of this project is to implement the smallest 32-bit ISA soft processor while aiming to achieve high performance. We propose UltraSmall and describe its key ideas and implementations. The evaluation results indicate that the hardware cost of UltraSmall is smaller than smallcore in the latest FPGA while achieving 1.8x performance of small-core.

international conference on networking and computing | 2010

Pattern-Based Systematic Task Mapping for Many-Core Processors

Shintaro Sano; Masahiro Sano; Shimpei Sato; Takefumi Miyoshi; Kenji Kise

The Network-on-Chip (NoC) is a promising interconnection for many-core processors. On the NoC-based many core processors, the network performance of multi-thread programs depends on the method of task mapping. In this paper, we propose a pattern-based task mapping method in order to improve the performance of many-core processors. Evaluation of the proposed method using a detailed software simulator reveals an average performance improvement of at least 4.4%, as compared with standard task mapping using NAS parallel benchmarks.

acm conference on systems programming languages and applications software for humanity | 2015

Investigating potential performance benefits of memory layout optimization based on roofline model

Shimpei Sato; Yukinori Sato; Toshio Endo

Performance tuning on a single CPU is still an essential base for massively parallelized applications in the upcoming exascale era to achieve its potential performance against their peak. In this paper, we investigate room for performance improvement by searching possible memory layout optimization. The target application is a stencil computation and we use the roofline model as a performance model of it. The application analysis result of the roofline model and the performance analysis tool which we have been developing expects the performance improvement by conducting padding to the source code. Thus, we explore the appropriate memory layout which achieves the application performance improvement by bruteforce searching of randomly generated 1,000 patterns from possible padding parameters. The evaluation measuring the application performance on a single node shows that the application using the memory layout achieves 4.3 times faster than original.

acm conference on systems programming languages and applications software for humanity | 2015

Exana: an execution-driven application analysis tool for assisting productive performance tuning

Yukinori Sato; Shimpei Sato; Toshio Endo

As modern memory subsystems have become complex, performance tuning of application code targeting for their deeper memory hierarchy is critical to rewarding their potential performance. However, it has been depending on time-consuming and empirical tasks by hands of domain experts. To assist such a performance tuning process, we have been developing an application analysis tool called Exana and attempted to automate some parts of it. Using already complied executable binary code as an input, Exana can transparently analyze program structures, data dependences, memory access characteristics, cache hit/miss statistics across program execution. In this paper, we demonstrate usefulness and productiveness of these analyses, and evaluate the overheads for them. After we demonstrate that our analysis is feasible and useful to the actual HPC application programs, we show that the overheads of Exanas analyses are much less than these of existing architectural simulators.

2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs | 2014

KNoCEmu: High Speed FPGA Emulator for Kilo-node Scale NoCs

Thiem Van Chu; Shimpei Sato; Kenji Kise

Many-core architectures are becoming mainstream in both processor designs and System-on-Chip (SoC) designs. With the growing number of cores on a chip, Network-on-Chip (NoC) has become the de-facto on-chip communication infrastructure. Since it is believed that the near future many-core architectures will have thousands of cores integrated on a single chip, it is essential to have both full-system simulators and stand-alone NoC simulators for supporting architectural design exploration and performance evaluation of such kilo scale many-core systems. This paper proposes KNoCEmu, an FPGA emulator which can achieve fast and cycle-accurate Kilo-Node scale NoC simulations. To overcome the limitation of FPGA resources, we propose a method which reduces the amount of required FPGA resources while maintaining the simulation accuracy. The time-division multiplexing technique is adopted to emulate the behavior of the entire network using one or several physical nodes. Our design is optimized to efficiently use FPGA resources such as block RAM and distributed RAM. We have implemented KNoCEmu for emulating a 32×32 mesh NoC with the conventional input buffer router on a Virtex 7 XC7VX485T FPGA and evaluated the amount of occupied FPGA resources and the simulation speedup over a software-based software simulator. The evaluation results show that KNoCEmu can achieve 134× simulation speedup over the software-based simulator while using less than 8% of the total number of available slices of the Virtex 7 XC7VX485T FPGA.

Explore More