Masahiro Goshima | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Masahiro Goshima is active.

Explore More

Publication

Featured researches published by Masahiro Goshima.

international symposium on microarchitecture | 2001

A high-speed dynamic instruction scheduling scheme for superscalar processors

Masahiro Goshima; Kengo Nishino; Toshiaki Kitamura; Yasuhiko Nakashima; Shinji Tomita; Shin-ichiro Mori

The wakeup logic is a part of the issuing window and is responsible to manage the ready flags of the operands for dynamic instruction scheduling. The conventional wakeup logic is based on association, and composed of a RAM and a CAM. Since the logic is not pipelinable and the delays of these memories are dominated by the wire delays, the logic will be more critical with deeper pipelines and smaller feature sizes. This paper describes a new scheduling scheme not based on the association but on matrices which represent the dependences between instructions. Since the update logic of the matrices detects the dependencies between instructions as the register renaming logic does, the wakeup operation is realized by just reading the matrices. This paper also describes a technique to reduce the effective size of the matrices for small IPC penalties. We designed the layouts of the logics guided by a 0.18µm CMOS design rule provided by Fujitsu Limited, and calculated the delays. We also evaluated the penalties by cycle-level simulation. The results show that our scheme achieves 2.7GHz clock speed for the IPC degradation of about 1%.

conference on high performance computing (supercomputing) | 1993

A distributed shared memory multiprocessor: ASURA - Memory and cache architectures

Shin-ichiro Mori; Hideki Saito; Masahiro Goshima; Mamoru Yanagihara; Takashi Tanaka; David Fraser; Kazuki Joe; Hiroyuki Nitta; Shinji Tomita

ASURA is a large scale, cluster-based, distributed, shared memory, multiprocessor being developed at Kyoto University and Kubota Corporation. Up to 128 clusters are interconnected to form an ASURA system of up to 1024 processors. The basic concept of the ASURA design is to take advantage of the hierarchical structure of the system. Implementing this concept, a large shared cache is placed between each cluster and the inter-cluster network. The shared cache and the shared memories distributed among the clusters form part of ASURAs hierarchical memory architecture, providing various unique features to ASURA. In this paper, the hierarchical memory architecture of ASURA and its unique cache coherence scheme, including a proposal of a new hierarchical directory scheme, are described with some simulation results.

Proceedings Innovative Architecture for Future Generation High-Performance Processors and Systems | 1997

The intelligent cache controller of a massively parallel processor JUMP-I

Masahiro Goshima; Shin-ichiro Mori; Hiroshi Nakashima; Shinji Tomita

This paper describes the intelligent cache controller of JUMP-I, a distributed shared memory type MPP. JUMP-I adopts an off-the-shelf superscalar as the element processor to meet the requirement of peak performance, but such a processor lacks the ability to hide inter-processor communication latency, which may easily become too long on MPPs. Therefore JUMP-I provides an intelligent memory system to remedy the weak point. The cache controller is one of the main components of the memory system, and provides many cache-level supports for inter-processor communication; explicit cache control, high-bandwidth cache prefetching, and a few types of synchronization structures for fine-grained message communication.

international conference on parallel architectures and compilation techniques | 1998

Optimized code generation for heterogeneous computing environment using parallelizing compiler TINPAR

Shin-ya Goto; Atsushi Kubota; Toshihiko Tanaka; Masahiro Goshima; Shin-ichiro Mori; Hiroshi Nakashima; Shinji Tomita

This paper presents a compiling technique to generate optimized codes for heterogeneous computing environment. This paper also proposes a new dynamic load redistribution mechanism which can adaptively and dynamically distribute tasks among computers according to their available computing power which may vary during the computation. As the results of the performance evaluation, we could confirm that the generated codes are effectively executed in a heterogeneous computing environment with dynamic load change.

parallel symbolic computation | 1997

Improvement of message communication in concurrent logic language

Kazuhiko Ohno; Masahiko Ikawa; Shin-ichiro Mori; Hiroshi Nakashima; Shinji Tomita; Masahiro Goshima

In the execution of concurrent logic language KL1 on messagepassing multiprocessors, frequent fine-grained communicantions cause a drastic inefficiency. We propose an opt imization scheme which achieves high granularity of messages by packing data transfer. Using static analysis, we derive the data typea which are required by the receiver procaa. With this information, each data of these types are packed into large messages. As a result of evaluation, the number of communications was considerably reduced. This effects to reduce the execution time of programs whkh have large communication overhead.

Lecture Notes in Computer Science | 1997

Efficient Goal Scheduling in Concurrent Logic Language using Type-Based Dependency Analysis

Kazuhiko Ohno; Masahiko Ikawa; Masahiro Goshima; Shin-ichiro Mori; Hiroshi Nakashima; Shinji Tomita

In the execution model of concurrent logic languages like KL1, each goal is regarded as a unit of concurrent execution. Although this fine-grained concurrency control enables flexible concurrent/parallel programming, its overhead also causes inefficiency in its implementation. We propose an efficient goal scheduling scheme using the result of static analysis. In this scheme, we obtain precise dependency relations among goals using type-based dependency analysis. Then each set of goals without concurrency is compiled into one thread, a sequence of statically ordered goals, to reduce the overhead of goal scheduling. Since stacks are used to hold goal environments for each thread, the number of garbage collection is also reduced. The result of preliminary evaluation shows our scheme considerably reduces goal scheduling overhead and thus it achieves 1.3–3 times speedup.

ieee international conference on high performance computing data and analytics | 1997

A Technique to Eliminate Redundant Inter-Processor Communication on Parallelizing Compiler TINPAR

Atsushi Kubota; Shogo Tatsumi; Toshihiko Tanaka; Masahiro Goshima; Shin-ichiro Mori; Hiroshi Nakashima; Shinji Tomita

Optimizing inter-processor(PE) communication is crucial for parallelizing compilers for message-passing parallel machines to achieve high performance. In this paper, we propose a technique to eliminate redundant inter-PE messages. This technique utilizes a data-flow analysis to find a definition point that corresponds to a use point where the definition and the use are occurred in different PEs. If several read accesses occurred in the same PE use the data defined at the same definition point in another PE, redundant inter-PE messages are eliminated as follows: only one inter-PE communication is performed for the earliest read access and the previously received data axe used for the following read accesses. In order to guarantee the consistency of the data, a valid flag and a sent flag are provided for each chunk of received data. The control of these flags is equivalent to the coherence control by the self invalidation on a compiler aided cache coherence scheme.

Genome Informatics | 1993