Masahiko Iwane
Kyushu Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Masahiko Iwane.
international symposium on parallel architectures algorithms and networks | 2005
Akira Yamawaki; Masahiko Iwane
A chip-multiprocessor is one of the promising architectures that can overcome the ILP limitation, high power consumption and high heating that current processors face. On a shared memory multiprocessor, a performance improvement relies on an efficient communication and synchronization method via shared variables. The TSVM cache combines communication and synchronization with the coherence maintenance on a chip-multiprocessor. That is, the communication and synchronization via shared variables are realized by one coherence transaction through a high-speed on chip inter-connection. The TSVM cache provides several instructions that each instruction has the individual coherence maintenance scheme. The combinations of these instructions can realize the producer-consumers synchronization, mutual exclusion and barrier synchronization with communication easily and systematically. This paper describes how those instructions construct three primitives and shows effect of these primitives using a clock cycle-accurate simulator written in VHDL. The result shows that the TSVM cache can improve a performance of 9.8 times compared with a traditional cache memory, and improve a performance of 2 times compared with a conventional cache memory with synchronization mechanism.
international conference on parallel and distributed systems | 2007
Akira Yamawaki; Masahiko Iwane
FPGA based multiprocessor SoC (MPSoC) is an on-chip multiprocessor with fully programmable feature which can reduce development cost and achieve performance requirement. In order to provide an MPSoC with the low-overhead communication and synchronization methods, this paper attempts to introduce the TSVM (tagged shared variable memory) cache to a snooping cache on the MPSoC. The TSVM cache can improve a performance by combining communication and synchronization with the coherence maintenance. Using an FPGA, we evaluate how extending a conventional snooping cache affects circuitries and clock speed. As a result, the growth of hardware amount and the degradation of clock speed are only 5% and 2% respectively. It is also confirmed that the TSVM cache improves significantly performance and energy efficiency by stalling in synchronization.
international symposium on communications and information technologies | 2004
Kosei Shimoo; Akin Yamawaki; Masahiko Iwane
This work proposes a new parallel processing paradigm using a reconfigurable computing system. In proposed system, an application program is parallelized statically into threads. They are translated to hardware threads, and then, they are executed by a reconfigurable computing system directly. The hardware threads cooperate with a high-speed synchronization using tags on the reconfigurable device. To evaluate the effect of a parallel processing and the performance impact of synchronization, we have performed the preliminary experiments using some application programs on a real machine. The result shows that extracting parallelisms improves the performance of 1.37 times at average. The low-overhead communication and synchronization on the reconfigurable device can achieve good performance improvement on a parallel processing.
Systems and Computers in Japan | 2001
Masahiko Iwane; Akira Yamawaki; Makoto Tanaka
Multiprocessor-on-a-chip is becoming possible due to progress in semiconductor technologies. In the multiprocessor, the threaded parallel processing requires decreasing the overhead of interthread communication and synchronization. We propose the tagged communication and synchronization memory with the access counter using CAM (TCSM) which supports a high-speed mechanism for the mutual exclusion, the condition synchronization, the barrier synchronization, and the multicasting between the threads. TCSM allocates its entries dynamically and ensures the producer–consumer synchronization by valid/invalid state in the access count. The execution environment of the threads is protected because the tag of TCSM is used for identifying both the task which threads belong to and the storage used by threads. The MTA/TCSM multichip multiprocessor system has been developed to evaluate the multiprocessor-on-a-chip including TCSM. As a result of the evaluation on MTA/TCSM, the overhead of interthread synchronization and communication using TCSM is lower than using the conventional shared memory.
international conference on parallel and distributed systems | 2002
Akira Yamawaki; Masahiko Iwane
The TSVM is a logical structured memory with a synchronization to improve a performance in a multi-threaded parallel processing. The physical TSVM is realized by the TSVM cache (TC) and a conventional memory in a Multiprocessor-on-a-chip (MOC) system. The L1 cache in a CPU consists of the TC, the General variable cache (GVC) and the instruction cache. The IYA (IY architecture) that is a new architecture divides a conventional data cache into the TC and GVC. The TC caches the shared variables with a synchronization, and the GVC caches other general variables. Regardless of a CPU core, a MOC with the IYA can utilize parallelisms from the instruction level and the statement level to the thread level systematically. To estimate the effect of the TC, preliminary experiments are performed on the multi-chip multiprocessor including the stand-alone TSVM. The result shows that the TSVM cache improves the performance.
international conference on computational science and its applications | 2010
Akira Yamawaki; Seiichi Serikawa; Masahiko Iwane
To improve the performance and power-consumption of the system-on-chip (SoC), the software processes are often converted to the hardware. However, to extract the performance of the hardware as much as possible, the memory access must be improved. In addition, the development period of the hardware has to be reduced because the life-cycle of SoC is commonly short. This paper proposes a design-level hardware architecture (semi-programmable hardware: SPHW) which is inserted onto the pass from C to hardware. On the SPHW, the memory accesses and buffers are realized by the software programming and parameters respectively. By using the SPHW you can easily develop the data processing hardware containing the efficient memory access controller at C-level abstraction. Compared with the conventional cases, the SPHW can reduce the development time significantly. The experimental result also shows that you can employ the SPHW as the final product if the memory access latency is hidden enough.
asia pacific conference on circuits and systems | 2008
Akira Yamawaki; Kazuharu Morita; Masahiko Iwane
The discrete wavelet transform with 5/3 filter of the JPEG2000 shows different memory access patterns according to the levels of decomposition. To make computation faster by hardware implementation, hiding memory access latency is important to improve a performance. The semi-programmable hardware (SPWH) as an intermediate hardware provides an efficient design method of a hardware with data prefetching to hide the memory access latency across the different memory access patterns. This paper describes the SPHW and demonstrates a mapping method for the DWT. The experimental result shows that the SPHW can significantly reduce a design burden and achieve a good performance.
asia pacific conference on circuits and systems | 2004
K. Shimoo; Akira Yamawaki; Masahiko Iwane
This paper describes a hardware design method extracting parallelisms from a C program. Also, the effect of extracting parallelisms to design a hardware is clarified through the experiments using some programs. The result shows that extracting parallelisms improves the performance of 2.08 times at average compared to the hardware that does not employ the parallelization
field programmable gate arrays | 2009
Akira Yamawaki; Masahiko Iwane
We propose the semi-programmable hardware (SPHW) as an intermediate hardware that can be used by the designers and the high-level synthesis tools converting the C programs to FPGAs. The SPHW consists of a load/store unit (LSU), a reconfigurable register file (RRF) and an execution unit (EXU). The LSU, executing the load/store instructions, transfers the data between the memory and the RRF. The hardware designed to make the computation faster is implemented on the EXU. The EXU is a reconfigurable hardware unit and processes the data on the RRF. The LSU flexibly performs complex memory accesses and bufferings by programming, so that the EXU can uniformly process the sequential data on the RRF. Since the EXU runs in parallel to the LSU, the memory access can be overlapped with the data processing. In addition, the SPHW that has a synchronization mechanism supports an execution of the multiple hardware threads on the EXU. By using the SPHW, the C programs can be easily converted to the hardware modules with data prefetching mechanisms. An experiment is performed using some application programs that show different memory access patterns. Compared with the cases that the custom data prefetching circuit is attached instead of the LSU, the SPHW can significantly reduce a design cost, achieving a comparable performance.
field-programmable technology | 2007
Akira Yamawaki; Masahiko Iwane
This paper proposes to introduce a programmable load/store unit (LSU) to C-based hardware design for an FPGA. The LSU provides flexible memory access methods that can hide memory access latency for hardware modules generated by a high-level synthesis tool. The hardware module with the LSU can treat efficiently not only simple streaming accesses but also sophisticated accesses such as image processing. The LSU is evaluated using two case studies. The result shows that the LSU can significantly reduce the burden of designing the dedicated memory access circuits and the hardware modules with the LSU achieve a speedup of 16.5 and 27.3 times compared with an embedded processor.