Ryota Shioya | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ryota Shioya is active.

Explore More

Publication

Featured researches published by Ryota Shioya.

international symposium on microarchitecture | 2014

A Front-end Execution Architecture for High Energy Efficiency

Ryota Shioya; Masahiro Goshima; Hideki Ando

Smart phones and tablets have recently become widespread and dominant in the computer market. Users require that these mobile devices provide a high-quality experience and an even higher performance. Hence, major developers adopt out-of-order superscalar processors as application processors. However, these processors consume much more energy than in-order superscalar processors, because a large amount of energy is consumed by the hardware for dynamic instruction scheduling. We propose a Front-end Execution Architecture (FXA). FXA has two execution units: an out-of-order execution unit (OXU) and an in-order execution unit (IXU). The OXU is the execution core of a common out-of-order superscalar processor. In contrast, the IXU comprises functional units and a bypass network only. The IXU is placed at the processor front end and executes instructions without scheduling. Fetched instructions are first fed to the IXU, and the instructions that are already ready or become ready to execute by the resolution of their dependencies through operand bypassing in the IXU are executed in-order. Not ready instructions go through the IXU as a NOP, thereby, its pipeline is not stalled, and instructions keep flowing. The not-ready instructions are then dispatched to the OXU, and are executed out-of-order. The IXU does not include dynamic scheduling logic, and its energy consumption is consequently small. Evaluation results show that FXA can execute over 50% of instructions using IXU, thereby making it possible to shrink the energy-consuming OXU without incurring performance degradation. As a result, FXA achieves both a high performance and low energy consumption. We evaluated FXA compared with conventional out-of-order/in-order superscalar processors after ARM big. LITTLE architecture. The results show that FXA achieves performance improvements of 67% at the maximum and 7.4% on geometric mean in SPECCPU INT 2006 benchmark suite relative to a conventional superscalar processor (big), while reducing the energy consumption by 86% at the issue queue and 17% in the whole processor. The performance/energy ratio (the inverse of the energy-delay product) of FXA is 25% higher than that of a conventional superscalar processor (big) and 27% higher than that of a conventional in-order superscalar processor (LITTLE).

pacific rim international symposium on dependable computing | 2009

Low-Overhead Architecture for Security Tag

Ryota Shioya; Daewung Kim; Kazuo Horio; Masahiro Goshima; Shuichi Sakai

A security-tagged architecture is one that applies tags on data to detect attack or information leakage, tracking data flow.The previous studies using security-tagged architecture mostly focused on how to utilize tags, not how the tags are implemented. A naive implementation of tags simply adds a tag field to every byte of the cache and the memory. Such technique, however, results in a huge hardware overhead.This paper proposes a low-overhead tagged architecture. We achieve our goal by exploiting some properties of tag, the non-uniformity and the locality of reference. Our design includes a use of uniquely designed multi-level table and various cache-like structures, all contributing to exploit these properties. Under simulation, our method was able to limit the memory overhead to 1.8%, where a naive implementation suffered 12.5% overhead.

international conference on computer design | 2014

Energy efficiency improvement of renamed trace cache through the reduction of dependent path length

Ryota Shioya; Hideki Ando

A renaming logic is a high-cost module in a superscalar processor, and it consumes significant energy. For mitigating this, renamed trace cache (RTC), which caches renamed operands, was proposed. However, conventional RTCs have several problems such as low capacity-efficiency, large hardware overhead and insufficient caching of renamed operands. We propose a semi-global renamed trace cache (SGRTC) that caches only renamed operands whose distances from producers outside traces are short, and it solves the problems of conventional RTCs. Evaluation results show that SGRTC achieves 64% lower energy consumption for renaming with a 0.2% performance overhead compared to a conventional processor.

automation, robotics and control systems | 2018

A Tightly Coupled Heterogeneous Core with Highly Efficient Low-Power Mode.

Yasumasa Chidai; Kojiro Izuoka; Ryota Shioya; Masahiro Goshima; Hideki Ando

A tightly coupled heterogeneous core (TCHC) has heterogeneous execution units with different characteristics inside the core. The composite core (CC) and the front-end execution architecture (FXA) are examples of state-of-the-art TCHCs. These TCHCs have in-order and out-of-order execution units in the core. They selectively execute instructions in-order and it improves the energy efficiency without significant performance degradation compared to out-of-order execution. However, these TCHCs cannot improve the energy efficiency sufficiently. CC has a large switching penalty of the execution units, and thus, CC cannot sufficiently execute instructions in-order. FXA cannot suspend energy consuming out-of-order execution units when it executes instructions in-order. We propose a dual-mode frontend execution architecture (DM-FXA), which is based on the FXA. DM-FXA has our proposed low-power execution mode, which completely suspends the out-of-order execution unit on in-order execution, and thus, DM-FXA consumes less energy than does the FXA. In addition, DM-FXA has a smaller switching penalty than CC. In our evaluation, the proposed methods reduce energy consumption by 34.7% compared with a conventional out-of-order processor, and performance degradation is within 3.2%.

Journal of Information Processing | 2018

Bank-Aware Instruction Scheduler for a Multibanked Register File

Junji Yamada; Ushio Jimbo; Ryota Shioya; Masahiro Goshima; Shuichi Sakai

The region that includes the register file is a hot spot in high-performance cores that limits the clock frequency. Although multibanking drastically reduces the area and energy consumption of the register files of superscalar processor cores, it suffers from low IPC due to bank conflicts. This paper proposes a bank-aware instruction scheduler which selects instructions so that no bank conflict occurs, except for a bank conflict in one instruction. The evaluation results show that, compared with NORCS, which is the latest research on a register file for area and energy efficiency, a proposed register file with 24 banks achieves a 20.9% and 56.0% reduction in circuit area and in energy consumption, while maintaining a relative IPC of 97.0%, and the latency of the instruction scheduler.

Journal of Information Processing | 2018

Performance Improvement Techniques in Tightly Coupled Multicore Architectures for Single-Thread Applications

Keita Doi; Ryota Shioya; Hideki Ando

Current multicore processors achieve high throughput by executing multiple independent programs in parallel. However, it is difficult to utilize multiple cores effectively to reduce the execution time of a single program. This is due to a variety of problems, including slow inter-thread communication and high-overhead thread creation. Dramatic improvements in the single-core architecture have reached their limit; thus, it is necessary to effectively use multiple cores to reduce single-program execution time. Tightly coupled multicore architectures provide a potential solution because of their very low-latency inter-thread communication and very light-weight thread creation. One such multicore architecture called SKY has been proposed. SKY has shown its effectiveness in multithreaded execution of a single program, but several problems must be overcome before further performance improvements can be achieved. The problems this paper focuses on are as follows: 1) The SKY compiler partitions programs at a basic block level, but does not explore the inside of basic blocks. This misses an opportunity to find good partitioning. 2) The SKY processor always sequentializes a new thread if the forking core in which it is supposed to be created is busy. However, this is not necessarily a good decision. 3) If the execution of register communication instructions among cores is delayed, the other register communication instructions can be delayed, causing the following thread execution to stall. This situation occurs when the instruction window of a core becomes full. To address these problems, we propose the following three software and hardware techniques: 1) Instruction-level thread partitioning: the compiler explores the inside of basic blocks to find a better program partition. 2) Selective thread creation: the hardware selectively sequentializes or waits for the creation of a new thread to achieve better performance. 3) Automatic register communication: register communication is automatically performed by a small hardware support instead of using instruction window resources. We evaluated the performance of SKY using SPEC2000 benchmark programs. Results on four cores show that the proposed techniques improved performance by 4% and 26% on average (maximum of 11% and 206%) for SPECint2000 and SPECfp2000 programs, respectively, compared with the case where the proposed techniques are not applied. As a result, performance improvements of 1.21 and 1.93 times on average (maximum of 1.52 and 3.30 times) were achieved, respectively, compared with the performance of a single core.

pacific rim conference on communications, computers and signal processing | 2017

Initial study of a phase-aware scheduling for hardware transactional memory

Tomoki Tajimi; Anju Hirota; Ryota Shioya; Masahiro Goshima; Tomoaki Tsumura

Transactional memory is a promising paradigm for shared memory parallel programming model. Effective transaction scheduling is very important for transactional memory systems, and a substantial body of work has been conducted. We have proposed a transaction scheduling which considers execution path variation in transactions, and it goes well with many types of programs, but some programs still can not gain performance. In this paper, we focus on such programs and investigate the reason for low performance by analyzing conflict prediction accuracy and typical conflict patterns. Then, we propose a novel phase-aware transaction scheduling for resolving one of the harmful conflict patterns. Evaluation result shows that the phase-aware scheduling can largely improve the performance of one of the benchmark programs, and indicates its potential superiority.

Archive | 2009