Kazuhiko Ohno | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kazuhiko Ohno is active.

Explore More

Publication

Featured researches published by Kazuhiko Ohno.

international parallel and distributed processing symposium | 2000

A mechanism for speculative memory accesses following synchronizing operations

Takayuki Sato; Kazuhiko Ohno; Hiroshi Nakashima

In order to reduce the overhead of synchronizing operations of shared memory multiprocessors, this paper proposes a mechanism, named specMEM, to execute memory accesses following a synchronizing operation speculatively before the completion of the synchronization is confirmed. A unique feature of our mechanism is that the detection of speculation failure and the restoration of computational state on the failure are implemented by a small extension of coherent cache. It is also remarkable that operations for speculation on its success and failure are performed in a constant time for each independent of the number of speculative accesses. This is realized by implementing a part of cache tag for cache line state with a simple functional memory. This paper also describes an evaluation result of specMEM applied to barrier synchronization. Performance data was obtained by simulation running benchmark programs in SPLASH-2. We found that the execution time of LU decomposition, in which the length of period between a pair of barriers significantly varies because of the fluctuation of computational load, is improved by 13%.

modeling, analysis, and simulation on computer and telecommunication systems | 2002

Shaman: a distributed simulator for shared memory multiprocessors

Haruyuki Matsuo; Shigeru Imafuku; Kazuhiko Ohno; Hiroshi Nakashima

The paper describes Shaman, our distributed architectural simulator of shared memory multiprocessors (SMP). The simulator runs on a PC cluster that consists of multiple front-end nodes to simulate the instruction level behavior of a target multiprocessor in parallel and a back-end node to simulate the target memory system. The front-end also simulates the logical behavior of the shared memory using a software DSM (distributed shared memory) technique and generates memory references to drive the back-end. A remarkable feature of our simulator is reference filtering to reduce the amount of the references transferred from the front-end to the back-end utilizing the DSM mechanism and coherent cache simulation on the front-end. This technique and our sophisticated DSM implementation give an extraordinary performance to the Shaman simulator. We achieved 335 million and 392 million simulation clock per second for LU decomposition and FFT in SPLASH-2 kernel benchmarks respectively, when we used 16 front-end nodes to simulate a 16-way target SMP.

annual simulation symposium | 2001

Reference filtering for distributed simulation of shared memory multiprocessors

Shigeru Imafuku; Kazuhiko Ohno; Hiroshi Nakashima

This paper proposes a method to reduce the amount of the memory references generated by the front-end of our distributed execution-driven simulator for shared memory multiprocessors named Shaman. The simulator consists of a front-end to execute programs in parallel and a backend, driven by the memory references from the front-end, to simulate the behavior of the memory system of a target multiprocessor for high-performance simulation. The front-end runs on a PC cluster using software DSM technique and partially simulates the coherent cache of the target system. The key idea of the reference reduction is to use the caches in the front-end as a filter of the references. We prove that the filtering for a memory block is safe if it is accessed in data-race-free manner as the whole. We also show a method to detect racing blocks to inactivate the filtering. The preliminary experiment with SPLASH-2 kernels shows up to 99.6% of references are filtered out and redundant references are less than 1.4%.

Parallel and distributed computing and systems | 2012

SUPPORTING DYNAMIC DATA STRUCTURES IN A SHARED-MEMORY BASED GPGPU PROGRAMMING FRAMEWORK

Kazuhiko Ohno; Masaki Matsumoto; Tomoharu Kamiya; Takanori Maruyama

Although Graphics Processing Unit (GPU) is expected to be a practical high performance computing platform, current programming frameworks such as CUDA and OpenCL require large programming cost. Therefore, we are developing a new framework MESI-CUDAproviding shared variables to hide low-level data management in CUDA. However, handling dynamic data structures is difficult in current MESI-CUDA because shared variables cannot be dynamically created and pointer fields are not allowed in them. Thus, we extended MESI-CUDA to remove such restrictions. Introducing dynamic management of shared variables and automatic pointer conversion on data transfer, any pointer-based dynamic data structure can be shared between the CPU and GPU with only small changes from the C code. As the results of the evaluation, pointer conversion increased the transfer time of data structures approximately 3.3 times larger in the worst case, and 1.3–2 times larger in the practical cases. Considering that non-conversion alternatives cause overhead in pointer dereferences, we regard this overhead is practical in most cases.

International Journal of Computer and Electrical Engineering | 2013

Reducing Dynamic Energy of Variable Level Cache

Ko Watanabe; Takahiro Sasaki; Tomoyuki Nakabayashi; Kazuhiko Ohno; Toshio Kondo

Today, a high-performance and low-power processor is required. Even on an embedded processor, a large cache is implemented to achieve high performance, and leakage energy in such large cache has been increasing caused by scale down of semiconductor device. Therefore, reduction of cache leakage energy is very important. To reduce cache leakage energy, we propose a variable level cache (VLC) which reduces leakage energy with a little performance degradation. Generally, a required size of a cache depends on a program behavior. VLC dynamically estimates whether the current cache size is suitable for a running program, and if not VLC modifies its cache structure and cache hierarchy to change cache capacity. Since changing the cache construction and hierarchy incurs a large overhead, VLC adopts the low-overhead technique which reduces hierarchy changing overhead to prevent performance from degrading. However, previous VLC has a problem that dynamic energy consumption is increased because the technique needs many futile accesses. To solve the problem, this paper proposes the novel technique which reduces the number of the futile accesses to reduce dynamic energy consumption. According to our simulation results, the proposed VLC technique reduces 18% dynamic energy without performance degradation compared with the previous VLC.

international symposium on information and automation | 2010

Evaluation of Variable Level Cache

Nobuyuki Matsubara; Takahiro Sasaki; Kazuhiko Ohno; Toshio Kondo

This paper proposes a variable level cache (VLC) to reduce leakage power of cache memory. Generally, required cache size depends on a program feature and data set. Therefore, VLC estimates the required cache size dynamically, and if the required size is small, it divides cache memory in half. The upper half memory functions as normal cache memory and the lower half memory shifts into stand-by mode to reduce leakage current and performs as lower level cache. According to our simulation results, the VLC is approximately 36% improvement of the energy-delay product to the conventional approach.

ieee international conference on high performance computing data and analytics | 2000

Orgel: A Parallel Programming Language with Declarative Communication Streams

Kazuhiko Ohno; Shigehiro Yamamoto; Takanori Okano; Hiroshi Nakashima

Because of the irregular and dynamic data structures, parallel programming in non-numerical field often requires asynchronous and unspecific number of messages. Such programs are hard to write using MPI/Pthreads, and many new parallel languages, designed to hide messages under the runtime system, suffer from the execution overhead. Thus, we propose a parallel programming language Orgel that enables brief and efficient programming. An Orgel program is a set of agents connected with abstract channels called streams. The stream connections and messages are declaratively specified, which prevents bugs due to the parallelization, and also enables effective optimization. The computation in each agent is described in usual sequential language, thus efficient execution is possible. The result of evaluation shows the overhead of concurrent switching and communication in Orgel is only 1.2 and 4.3 times larger than that of Pthreads, respectively. In the parallel execution, we obtained 6.5-10 times speedup with 11-13 processors.

Atmospheric Environment | 1999

Improved implementations of the speculative memory access mechanism specMEM

Hiroshi Nakashima; Takayuki Sato; Haruyuki Matsuo; Kazuhiko Ohno

In order to reduce the overhead of synchronizing operations of shared memory multiprocessors, we proposed a mechanism, named specMEM, to execute memory accesses following a synchronizing operation speculatively before the completion of the synchronization is confirmed. A unique feature of our mechanism is that the detection of speculation failure and the restoration of computational state on the failure are implemented by a small extension of coherent cache. It is also remarkable that operations for speculation on its success and failure are performed in a constant time for each independent of the number of speculative accesses. Although we reported previously that specMEM achieves significant execution time reduction, for example 13% for LU decomposition, we also observed that it may be implemented more efficiently. This paper discusses about more efficient implementations of specMEM with an extra cache state and/or a non-speculative secondary cache.

parallel symbolic computation | 1997

Improvement of message communication in concurrent logic language

Kazuhiko Ohno; Masahiko Ikawa; Shin-ichiro Mori; Hiroshi Nakashima; Shinji Tomita; Masahiro Goshima

In the execution of concurrent logic language KL1 on messagepassing multiprocessors, frequent fine-grained communicantions cause a drastic inefficiency. We propose an opt imization scheme which achieves high granularity of messages by packing data transfer. Using static analysis, we derive the data typea which are required by the receiver procaa. With this information, each data of these types are packed into large messages. As a result of evaluation, the number of communications was considerably reduced. This effects to reduce the execution time of programs whkh have large communication overhead.

Lecture Notes in Computer Science | 1997

Efficient Goal Scheduling in Concurrent Logic Language using Type-Based Dependency Analysis

Kazuhiko Ohno; Masahiko Ikawa; Masahiro Goshima; Shin-ichiro Mori; Hiroshi Nakashima; Shinji Tomita

In the execution model of concurrent logic languages like KL1, each goal is regarded as a unit of concurrent execution. Although this fine-grained concurrency control enables flexible concurrent/parallel programming, its overhead also causes inefficiency in its implementation. We propose an efficient goal scheduling scheme using the result of static analysis. In this scheme, we obtain precise dependency relations among goals using type-based dependency analysis. Then each set of goals without concurrency is compiled into one thread, a sequence of statically ordered goals, to reduce the overhead of goal scheduling. Since stacks are used to hold goal environments for each thread, the number of garbage collection is also reduced. The result of preliminary evaluation shows our scheme considerably reduces goal scheduling overhead and thus it achieves 1.3–3 times speedup.

Explore More