Kazuaki Murakami | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kazuaki Murakami is active.

Explore More

Publication

Featured researches published by Kazuaki Murakami.

international symposium on low power electronics and design | 1999

Way-predicting set-associative cache for high performance and low energy consumption

Koji Inoue; Tohru Ishihara; Kazuaki Murakami

This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache,.

international symposium on low power electronics and design | 1998

Optimizing the DRAM refresh count for merged DRAM/logic LSIs

Taku Ohsawa; Koji Kai; Kazuaki Murakami

In merged DRAM/logic LSIs, the DRAM portion could suffer from shorter data retention time because of heat and noise caused by the logic portion. Frequent refreshes increase power consumption. Also, they disturb normal DRAM accesses leading to performance degradation. In order to overcome this problem, we propose several DRAM refresh architectures. The basic idea is to eliminate unnecessary DRAM refreshes. We have estimated the DRAM refresh count in executing benchmark programs under several architecture models. As a result, in the most effective combination of the architectures, we have obtained more than 80% reduction against a conventional DRAM refresh architecture for most benchmark programs. In addition to it, even when we have taken normal DRAM access into account, we have obtained more than 50% reduction for several benchmarks.

ieee international conference on high performance computing data and analytics | 2008

Performance prediction of large-scale parallell system and application using macro-level simulation

Ryutaro Susukita; Hisashige Ando; Mutsumi Aoyagi; Hiroaki Honda; Yuichi Inadomi; Koji Inoue; Shigeru Ishizuki; Yasunori Kimura; Hidemi Komatsu; Motoyoshi Kurokawa; Kazuaki Murakami; Hidetomo Shibamura; Shuji Yamamura; Yunqing Yu

To predict application performance on an HPC system is an important technology for designing the computing system and developing applications. However, accurate prediction is a challenge, particularly, in the case of a future coming system with higher performance. In this paper, we present a new method for predicting application performance on HPC systems. This method combines modeling of sequential performance on a single processor and macro-level simulations of applications for parallel performance on the entire system. In the simulation, the execution flow is traced but kernel computations are omitted for reducing the execution time. Validation on a real terascale system showed that the predicted and measured performance agreed within 10% to 20 %. We employed the method in designing a hypothetical petascale system of 32768 SIMD-extended processor cores. For predicting application performance on the petascale system, the macro-level simulation required several hours.

IEEE Transactions on Applied Superconductivity | 2009

An Operand Routing Network for an SFQ Reconfigurable Data-Paths Processor

Irina Kataeva; Hiroyuki Akaike; Akira Fujimaki; Nobuyuki Yoshikawa; Naofumi Takagi; Koji Inoue; Hiroaki Honda; Kazuaki Murakami

We report the progress in the development of an operand routing network (ORN) for an SFQ reconfigurable data-paths processor (SFQ-RDP). The SFQ-RDP is implemented as a two-dimensional array of floating-point units (FPU), outputs of which can be connected to the inputs of one or more FPUs in the next row via ORN. We have considered two architectures of the ORN: one is based on NDRO switches and the other - on crossbar switches. The comparison shows that the crossbar-based ORN has better performance due to the regular pipelined structure. We have designed a crossbar switch with a multicasting function and a 1-to-2 ORN prototype for 2.5 kA/cm2 process. The circuits have been experimentally tested at the frequencies up to 36 GHz.

Proceedings of SPIE, the International Society for Optical Engineering | 2006

Technology mapping technique for throughput enhancement of character projection equipment

Makoto Sugihara; Taiga Takata; Kenta Nakamura; Ryoichi Inanami; Hiroaki Hayashi; Katsumi Kishimoto; Tetsuya Hasebe; Yukihiro Kawano; Yusuke Matsunaga; Kazuaki Murakami; Katsuya Okumura

The character projection is utilized for maskless lithography and is a potential for the future photomask manufacture. The drawback of the character projection is its low throughput and leads to a price rise of ICs. This paper discusses a technology mapping technique for enhancing the throughput of the character projection. The number of EB shots to draw an entire chip determines the fabrication time for the chip. Reduction of the number of EB shots, therefore, increases the throughput of character projection equipment and reduces the cost to produce ICs. Our technology mapping technique aims to reduce the number of EB shots to draw an entire chip for increasing the throughput of character projection equipment. Our technique treats the number of EB shots as an objective to minimize. Comparing with an conventional technology mapping, our technology mapping technique achieved 19.6% reduction of the number of EB shots without any performance degradation of ICs. Moreover, our technology mapping technique achieved 48.8% reduction of the number of EB shots under no performance constraints. Our technique is easy for both IC designers and equipment developers to adopt because it is a software approach with no additional modification on character projection equipment.

design, automation, and test in europe | 2007

Task scheduling for reliable cache architectures of multiprocessor systems

Makoto Sugihara; Tohru Ishihara; Kazuaki Murakami

This paper presents a task scheduling method for reliable cache architectures (RCAs) of multiprocessor systems. The RCAs dynamically switch their operation modes for reducing the usage of vulnerable SRAMs under real-time constraints. A mixed integer programming model has been built for minimizing vulnerability under real-time constraints. Experimental results have shown that our task scheduling method achieved 47.7--99.9% less vulnerability than a conventional approach.

field-programmable logic and applications | 2006

A Reconfigurable Functional Unit for an Adaptive Dynamic Extensible Processor

Hamid Noori; F. Mehdipou; Kazuaki Murakami; Koji Inoue; Morteza SahebZamani

This paper presents a reconfigurable functional unit (RFU) for an adaptive dynamic extensible processor. The processor can tune its extended instructions to the target applications, after chip-fabrication. The custom instructions (CIs) are generated deploying the hot basic blocks during the training mode. In the normal mode, CIs are executed on the RFU. A quantitative approach was used for designing the RFU. The RFU is a matrix of functional units with 8 inputs and 6 outputs. Performance is enhanced up to 1.25 using the proposed RFU for 22 applications of Mibench. This processor needs no extra opcodes for CIs, new compiler, source code modification and recompilation.

international parallel and distributed processing symposium | 2006

Reducing reconfiguration time of reconfigurable computing systems in integrated temporal partitioning and physical design framework

Farhad Mehdipour; Morteza Saheb Zamani; Hamid R. Ahmadifar; Mehdi Sedighi; Kazuaki Murakami

In reconfigurable systems, reconfiguration latency is a very important factor which impact the system performance. In this paper, a framework is proposed that integrates the temporal partitioning and physical design phases to perform a static compilation process for reconfigurable computing systems. A temporal partitioning algorithm is proposed which attempts to decrease the time of reconfiguration on a partially reconfigurable hardware. This algorithm attempts to find similar single or pair of operations between subsequent partitions. Considering similar pairs instead of single nodes brings about less complexity for routing process. By using this technique, smaller reconfiguration bit-stream is obtained, which directly decreases the reconfiguration overhead time at the run-time. A complementary algorithm attempts to increase the similarity of subsequent partitions by searching for similar pairs and using a technique called dummy node insertion. An incremental physical design process based on similar configurations produced in the partitioning stage improves the metrics over iterations.

ACM Sigarch Computer Architecture News | 1991

DSNS (dynamically-hazard-resolved statically-code-scheduled, nonuniform superscalar): yet another superscalar processor architecture

Morihiro Kuga; Kazuaki Murakami; Shinji Tomita

A new superscalar processor architecture, called DSNS (Dynamically-hazard-resolved, Statically-code-scheduled, Nonuniform Superscalar), is proposed and discussed. DSNS has the following major architectural features.1. Dynamically-hazard-resolved superscalar: DSNS is object-code compatible with respect to the degree of superscalar. Pipeline interlock hardware should be provided for detecting and resolving hazards at run time.2. Statically-cade-scheduled superscalar: The performance of DSNS could not be scalable with respect to the degree of superscalar. Compilers must be responsible for scheduling instructions to reduce the pipeline stalls for a particular degree of superscalar.3. Nonuniform superscalar: Although nonuniform superscalar potentially suffers instruction-class conflicts, it can be more cost-effective than uniform superscalar. Again compilers must take care that the class conflicts do not increase structural hazards.4. Static memory disambiguation: The DSNS architecture provides three types of LOAD/STORE instructions; strongly ordered, weakly ordered, and unordered. Memory disambiguation at compile time is responsible for marking each LOAD/STORE instruction. At run time, processors need not detect nor resolve data hazards for every type; they just perform memory accesses inorder for strongly or weakly ordered instructions, and arbitrarily for unordered.5. Static branch prediction with branch-target buffer: Branch instructions predicted as taken by compilers are stored in the branch target buffer. Hardware never guesses the outcomes of branch instructions.6. Early branch resolution with advanced conditioning: Advanced conditioning allows branch decisions to precede further the corresponding branches. It reduces the branch delay and results in resolving branches early in the pipeline.7. Conditional mode execution with dual register files: Dual register file facilitates maintaining the precise machine state that otherwise might be violated by speculative execution such as conditional mode.8. Weakly precise interrupts: The DSNS architecture defines interrupts as being somewhat imprecise but restartable with the help of interrupt handlers. The definition alleviates hardware constraints for ensuring precise interrupts strongly.This paper also presents an implementation of the DSNS architecture. The DSNS processor prototype under development is a four-stage pipelined processor of superscalar-degree four. The instruction pipelines, especially the branch pipeline, are discussed in detail.

Journal of Systems Architecture | 2011

A design scheme for a reconfigurable accelerator implemented by single-flux quantum circuits

Farhad Mehdipour; Hiroaki Honda; Koji Inoue; Hiroshi Kataoka; Kazuaki Murakami

A large-scale reconfigurable data-path processor (LSRDP) implemented by single-flux quantum (SFQ) circuits is introduced which is integrated to a general purpose processor to accelerate data flow graphs (DFGs) extracted from scientific applications. A number of applications are discovered and analyzed throughout the LSRDP design procedure. Various design steps and particularly the DFG mapping process are discussed and our techniques for optimizing the area of accelerator will be presented as well. Different design alternatives are examined through exploring the LSRDP design space and an appropriate architecture is determined for the accelerator. Primary experiments demonstrate capability of the designed architecture to achieve performance values up to 210 Gflops for attempted applications.

Explore More