Ikuo Nakata
University of Tsukuba
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ikuo Nakata.
The Journal of Supercomputing | 2001
Minyi Guo; Ikuo Nakata
Array redistribution is required often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution; otherwise the performance of the programs will degrade considerably. The redistribution overheads consist of two parts: index computation and inter-processor communication. In this paper, by using a notation for the local data description called an LDD, we propose a framework to optimize the array redistribution algorithm both in index computation and inter-processor communication. That is, our work makes an effort to optimize not only the computation cost but also communication cost for array redistribution algorithms. We present an efficient index computation method and generate a schedule that minimizes the number of communication steps and eliminates node contention in each communication step. Some experiments show the efficiency and flexibility of our techniques.
international conference on parallel and distributed systems | 1998
Minyi Guo; Ikuo Nakata; Yoshiyuki Yamashita
Array redistribution is required often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of the programs may degrade considerably. The redistribution overheads consist of two parts: index computation and interprocessor communication. If there is no communication scheduling in a redistribution algorithm, the communication contention may occur, which increases the communication waiting time. In order to solve this problem, we propose a technique to schedule the communication so that it becomes contention-free. Our approach initially generates a communication table to represent the communication relations among sending nodes and receiving nodes. According to the communication table, we then generate another table named communication scheduling table. Each column of the communication scheduling table is a permutation of receiving node numbers in each communication step. Thus the communications in our redistribution algorithm are contention-free. Our approach can deal with multi-dimensional shape changing redistribution.
international conference on supercomputing | 1993
Hiroshi Nakamura; Taisuke Boku; Hideo Wada; Hiromitsu Imori; Ikuo Nakata; Yasuhiro Inagami; Kisaburo Nakazawa; Yoshiyuki Yamashita
In this paper, we present a new scalar architecture for high-speed vector processing. Without using cache memory, the proposed architecture tolerates main memory access latency by introducing slide-windowed floating-point registers with data preloading feature and pipelined memory. The architecture can hold upward compatibility with existing scalar architectures. In the new architecture, software can control the window structure. This is the advantage compared with our previous work of register-windows. Because of this advantage, registers are utilized more flexibly and computational efficiency is largely enhanced. Furthermore, this flexibility helps the compiler to generate efficient object codes easily. We have evaluated its performance on Livermore Fortran Kernels. The evaluation results show that the proposed architecture reduces the penalty of main memory access better than an ordinary scalar processor and a processor with cache prefetching. The proposed architecture with 64 registers tolerates memory access latency of 30 CPU cyles. Compared with our previous work, the proposed architecture hides longer memory access latency with fewer registers.
parallel computing | 1999
Kisaburo Nakazawa; Hiroshi Nakamura; Taisuke Boku; Ikuo Nakata; Yoshiyuki Yamashita
Abstract Computational Physics by Parallel Array Computer System (CP-PACS) is a massively parallel processor developed and in full operation at the Center for Computational Physics at the University of Tsukuba. It is an MIMD machine with a distributed memory, equipped with 2048 processing units and 128 GB of main memory. The theoretical peak performance of CP-PACS is 614.4 Gflops. CP-PACS achieved 368.2 Gflops with the Linpack benchmark in 1996, which at that time was the fastest Gflops rating in the world. CP-PACS has two remarkable features. Pseudo Vector Processing feature (PVP-SW) on each node processor, which can perform high speed vector processing on a single chip superscalar microprocessor; and a 3-dimensional Hyper-Crossbar (3-D HXB) Interconnection network, which provides high speed and flexible communication among node processors. In this article, we present the overview of CP-PACS, the architectural topics, some details of hardware and support software, and several performance results.
Acta Informatica | 1986
Ikuo Nakata; Masataka Sassa
SummaryA method for building small fast LALR parsers for regular right part grammars is given. No grammar transformation is required. No extra state of the LALR parser for the recognition of strings generated by a right part is required. At some reduce states the parser may refer to lookback states (states in which the parser may be restarted after the reduction). An optimizing algorithm to reduce these references is also given.
hawaii international conference on system sciences | 1994
Hiroshi Nakamura; Hiromitsu Imori; Yoshiyuki Yamashita; Kisaburo Nakazawa; Taisuke Boku; H. Li; Ikuo Nakata
We present a new scalar processor for high-speed vector processing and its evaluation. The proposed processor can hide long main memory access latency by introducing slide-windowed floating-point registers with data preloading feature and pipelined memory. Owing to the slide-window structure, the proposed processor can utilize more floating-point registers in keeping upward compatibility with existing scalar architecture. We have evaluated its performance on Livermore Fortran Kernels. The evaluation results show that the proposed processor drastically reduces the penalty of main memory access compared with an ordinary scalar processor. For example, the proposed processor with 96 registers hides memory access latency of 70 CPU cycles when the throughput of main memory is 8 byte/cycle. From these results, it is concluded that the proposed architecture is very suitable for high-speed vector processing.<<ETX>>
IEEE Transactions on Software Engineering | 1991
Ikuo Nakata; Masataka Sassa
A description is given of features which were added to a conventional programming language that will manipulate streams of values. A stream is a sequence of values of a certain fixed type. The number of elements of a stream may be determined at execution time, and evaluation of each element can be postponed until its value is actually needed. Many programs can be expressed naturally and clearly as networks of processes communicating by means of streams. The network is called a composite function and consists of several component functions. Since component functions are connected solely by streams, they greatly increase the flexibility of combinations and the reusability of programs. Loop statements can be considered as iterative statements over streams. One general problem in these networks is the mechanism of terminating each process of the network. A practical solution for this problem is presented. Comparisons to other programming styles, such as coroutines, Lisp, functional programming, and dataflow languages, are described. Three modes of execution are considered for the implementation of composite functions: parallel mode, coroutine mode, and inline mode. In the inline mode, a composite function is expanded and transformed into a single function, realizing maximum run-time efficiency. Algorithms for this expansion are given. >
Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05) | 2005
Mitsugu Suzuki; Nobuhisa Fujinami; Takeaki Fukuoka; Tan Watanabe; Ikuo Nakata
COINS is a compiler infrastructure that makes it easy to construct a new compiler by adding/modifying only part of the COINS of compiling/optimization features. SIMD optimization is a major advantage. We present an overview of COINS and some topics on its SIMD optimization.
Software - Practice and Experience | 1995
Masataka Sassa; Harushi Ishizuka; Ikuo Nakata
We herein describe a compiler generator, Rie, which is based on a one‐pass‐type attribute grammar. LR‐attributed grammars are one class of attribute grammars in which attribute evaluation can be performed in one pass during LR parsing without creating a parse tree. Rie was developed based on a variant of an LR‐attributed grammar called ECLR‐attributed grammar (equivalence class LR‐attributed grammar), in which equivalence relations are introduced into the LR‐attributed grammar. Rie generates a one‐pass compiler from a compiler description given in attribute grammar form. Many language processors have been developed using Rie. The generated compiler is only about 1.8 times slower than a handwritten compiler, which is fairly efficient for a compiler generated from formal descriptions.
Advances in Software Science and Technology | 1993
Ikuo Nakata
Summary It is difficult to express the definition of the comments of C language in a regular expression. However, the definition can be expressed by a simple regular expression by introducing a special symbol, called the any -symbol, that represents any single character , or by introducing a kind of negation symbol into regular expressions. In general, the problem of string pattern matching can be expressed as such an extended regular expression, and the corresponding finite state automaton generated from the expression is equivalent to the Knuth-Morris-Pratt pattern-matching algorithm [4]. In particular, if we use the any -symbols, the pattern is not restricted to a string of characters. It can be any regular expression. Our method can also be applied to the problem of repeated pattern matching. The Aho-Corasick algorithm [3] can be derived mechanically from an extended regular expression that contains any -symbols.