Yuki Fukazawa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuki Fukazawa is active.

Explore More

Publication

Featured researches published by Yuki Fukazawa.

Concurrency and Computation: Practice and Experience | 2018

Tile/line access cache memory based on a multi-level Z-order tiling data layout: Tile/line access cache memory based on a multi-level Z-order tiling data layout

BaoKang Wang; Yuki Fukazawa; Toshio Kondo; Takahiro Sasaki

Ineffective column‐directional cache memory access has become a bottleneck for efficient two‐dimensional (2‐D) data processing utilizing extended single instruction multiple data (SIMD) instructions. To solve this problem, we propose a cache memory with tile (column and row directions) and line (row direction) accessibility for efficient 2‐D data processing. 2‐D data access to the proposed cache memory is enabled via a hardware‐based multi‐mode address translation unit that eliminates the overhead of software‐based address calculation. To reduce the hardware overhead of the proposed cache, we propose a tag memory reduction method that replaces multiple tiles with an aligned tile set (RATS) in the cache. To verify the feasibility of the proposed cache, an LSI layout of a SIMD‐based general purpose‐oriented datapath embedding the proposed cache is designed in a 2.5×5 mm2 area using 0.18‐μm CMOS technology. Under a 3.9‐ns clock period (250 MHz), the read latency is limited to 3 clock cycles, which is the same as that for the conventional cache memory. Using the RATS method, the entire hardware overhead of the proposed cache is reduced to only 7% of that required for a conventional cache. In addition, simulation results for the proposed cache indicate a considerable reduction of L1 and L2 cache confliction misses compared with a conventional cache in power‐of‐two matrix size due to the column‐directional address stride being sufficiently smaller than page size. Therefore, the proposed cache provides efficient column‐directional parallel access as same as row‐directional parallel access so that it enables efficient SIMD operation requiring no transposition in matrix multiplication (MM). For LU decomposition (LUD), the proposed cache can provide almost the same performance to the column‐major–based LUD program as that to the row‐major–based LUD program. These results show that the proposed cache does not restrict our freedom in selecting either row‐ or column‐major order coding.

international conference on acoustics, speech, and signal processing | 2016

SIMD-based datapath with efficient operation structure for motion estimation

Yuki Fukazawa; Keita Watanabe; Yuki Minoura; Toshio Kondo; Takahiro Sasaki

Even high-performance general purpose processors are not sufficient for speedy motion estimation (ME), though they support some SIMD instructions for ME. In this paper, we propose a SIMD-based general-purpose-oriented datapath with efficient operation structure for ME. The several additional components have been added to a comven-tional datapath base for ME acceleration. The results of its logic design and chip layout design showed the proposed datapath accelerates motion estimation speed by 3.99-5.06 times on average in the diamond, SUC and non-dense patterns compared to previous method with only 2.6% hardware area increase.

international symposium on computing and networking | 2016

A Rapid Verification Framework for Developing Multi-core Processor

Kouki Kayamuro; Takahiro Sasaki; Yuki Fukazawa; Toshio Kondo

A multi-core processor is widely used to achieve both high performance and low energy consumption. However, verification of a multi-core processor is more difficult than that of single-core processor. Because multi-core processor has large and complex circuits, and special mechanisms such as a cache coherency mechanism also increases complexity. In general, design flow of a processor include such as the following steps; functional verification by high level language, cycle accurate verification by RTL simulation and timing analysis considered gate delay, wiring delay and others. In particular, co-simulation framework for single-core processor has been proposed to reduce time of cycle accurate verification with RTL simulation. However, if the conventional framework is applied to verify a multi-core processor, it causes three problems; failure of the co-simulation framework by a mismatch of load and store operation, failure of the system call emulation by cache coherency mechanism and requirement of task scheduling by execution multi-threaded program. These problems makes verification of a multi-core processor difficult seriously and increases simulation time dramatically. Therefore, this paper proposes a rapid verification framework to support execution of a multi-threaded program for multi-core processors. The proposed method makes it possible to verify both a homogeneous and heterogeneous multi-core processors with the cache coherency mechanism, and to execute multithreaded programs without full system simulation. The proposed framework extends conventional co-simulation framework for a single-core processor. The proposed framework is composed of the following three extensions; bypassing loaded value from the verified processor to virtual processor, a cache access mechanism for system call emulation, and an internal task scheduler. As evaluation results, our framework verifies two-core processor correctly. Furthermore, the proposed method achieves to reduce the number of execution cycles by 71% in maximum and 46% in average compared with full system simulation.

annual acis international conference on computer and information science | 2016

A cache memory with unit tile and line accessibility

BaoKang Wang; Yuki Fukazawa; Toshio Kondo; Takahiro Sasaki

Ineffective data access of cache memories becomes a bottleneck for efficient 2-dimensional(2-D) data processing such as image processing and matrix multiplication. In order to solve this problem, a cache memory with both unit tile and unit line accessibility, based on 4-level Z-order tiling layout is proposed. Conventional raster scan order access to this layout is enabled by a hardware-base address translation, which can eliminate overhead of address calculation. The proposed cache can access data in parallel in vertical(unit tile) or horizontal(unit line) direction by 4-level Z-order tiling layout and multi-bank cache organization. Unit tile access corresponding to parallel data access in the vertical direction can exploit 2-D locality. Simulation results show that the 4-level Z-order tiling layout provides both less TLB and L1 data cache misses compared with the raster scan order and the Morton order layout in matrix multiplication and LU decomposition, especially at larger matrix size. An LSI chip of the proposed cache combined with a SIMD-based data path was designed in 2.5 × 5mm2 area by using 0.18μm CMOS technology. Under 3.8ns clock period, read and write latency was suppressed to 3 clock cycles of the same as conventional cache memory of an Intel or ARM high-performance processor.

International Journal of Computer and Electrical Engineering | 2016

Evaluation of Portability and Design Diversity of FabCache

Takahiro Sasaki; Takaki Okamoto; Seiji Miyoshi; Yuki Fukazawa; Toshio Kondo

Single-ISA heterogeneous multi-core architecture which consists of diverse superscalar cores is becoming more importantly in the processor architecture. Using a proper superscalar core for characteristic in a program contributes to reduce energy consumption and improve performance. However, designing a heterogeneous multi-core processor requires a large design and verification effort. Therefore, we have proposed FabHetero which generates diverse heterogeneous multi-core processors automatically using FabScalar, FabCache, and FabBus which generate various designs of superscalar core, cache system, and flexible shared bus system, respectively. In our previous work, we estimated the physical design, delay, and power consumption only on a L1 instruction cache of a 32-bit processor. However, almost all modern processor has a L1 data cache, and nowadays there are many 64-bit processors to achieve high-performance computing. This paper shows availability and efficiency of the FabCache by estimating overheads, area, delay, and power consumption, of both instruction cache and data cache systems for both 32-bit and 64-bit processors. FabCache has good portability because bus communication system of generated cache system from FabCache employs AMBA4 protocol that is widely used in various architecture to communicate with other design and its connection logic can be parameterized such as individual bus width. To show an effectiveness and portability of FabCache, this paper applies FabCache to FabScalar-alpha which is a 64-bit processor, and evaluates availability and effectiveness of the generated cache system. According to the evaluation results, the cache systems generated by FabCache works perfectly and the increased area is about 1.7%, delay is 0.1ns, and power is 0.1% compared with hand designed cache system. Evaluation results show that the FabCache can generate reasonable cache system and it has good portability.

international symposium on computing and networking | 2015

An Architectural Framework of Snoopy Interconnection for Heterogeneous Cache Systems

Seiji Miyoshi; Takahiro Sasaki; Yuki Fukazawa; Toshio Kondo

Single-ISA heterogeneous multi-core architecture which is composed of diverse cores, cache systems and shared bus system is promising technique to achieve higher energy efficiency. However, because heterogeneous multi-core processor (HMP) must be designed and verified each of cores, caches and shared bus system, an effort of implementing HMP is multiplied by the number of kinds of each of components. This limits an amount ofmicroarchitectural diversity of commercial or research products which can be practically implemented. In order to reduce the efforts, many researches have focused on automatic generation of HMP. However, generating of shared bus interconnection with supporting cache coherency mechanism of HMP is one of the major challenge to implement an automatic generation system of HMP because suitable implementation for each cache system is strongly tied its specification and a combination of cache systems is too huge. Nevertheless, hand-design is not realistic to implement HMP. Therefore, a framework which supports and helps to implement bus interconnection with cache coherency is needed for both research and commercial field. This paper proposes a framework of snoop-based interconnection for dealing with heterogeneous cache systems and bus interconnection. As the first step to develop and verify this framework, we implemented an automatic generating system of snoop-based interconnection using FabHetero ported to ARM AMBA4 and ACE protocol and verified the system correctly works.

international symposium on computing and networking | 2015

Register Port Prediction for a Banked Register File

Hiroaki Kawashima; Takahiro Sasaki; Yuki Fukazawa; Toshio Kondo

A large multi-port register file is an indispensable component to achieve higher computing performance, especially in recent processors. However, the number of its ports effects to circuit scale, access latency and power consumption significantly. Bank memory is one solution to implement a multi-port memory effectively. However, performance of the bank memory is lower than that of ideal multi-port memory. In order to reduce performance degradation caused by bank conflict, this paper proposes register write-back port prediction mechanism. This paper also implements the proposed prediction mechanism into a superscalar processor and estimates performance, access latency, circuit scale, and power consumption.

european test symposium | 2015

A fault tolerant response analyzer with self-error-correction capability

Yuki Fukazawa; Hideyuki Ichihara; Tomoo Inoue

Reliable built-in self-test (Reliable BIST) scheme equips to be tolerant of faults, which occur in embedded BIST circuits. To realize reliable BIST, it is required to recover itself from transient errors of its embedded BIST circuits. In this paper, we propose a self-error-correctable response analyzer (RA) for a reliable BIST scheme. Experimental results show that test-reliability of SECRA is superior to TMR MISRs on the assumption that transient faults occur in RA during testing CUTs.

Concurrency and Computation: Practice and Experience | 2018