Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Tokuzo Kiyohara is active.

Publication


Featured researches published by Tokuzo Kiyohara.


international symposium on computer architecture | 1993

Register connection: a new approach to adding registers into instruction set architectures

Tokuzo Kiyohara; Scott A. Mahlke; William Y. Chen; Roger A. Bringmann; Richard E. Hank; Sadun Anik; Wen-mei W. Hwu

Code optimization and scheduling for superscalar and superpipelined processors often increase the register requirement of programs. For existing instruction sets with a small to moderate number of registers, this increased register requirement can be a factor that limits the effectivess of the compiler. In this paper, we introduce a new architectural method for adding a set of extended registers into an architecture. Using a novel concept of connection, this method allows the data stored in the extended registers to be accessed by instructions that apparently reference core registers. Furthermore, we address the technical issues involved in applying the new method to an architecture: instruction set extension, procedure call convention, context switching considerations, upward compatibility, efficient implementation, compiler support, and performance. Experimental results based on a prototype compiler and execution driven simulation show that the proposed method can significantly improve the performance of superscalar processors with a small or moderate number of registers.


conference on high performance computing (supercomputing) | 1992

Compiler code transformations for superscalar-based high performance systems

Scott A. Mahlke; William Y. Chen; John C. Gyllenhaal; Wen-mei W. Hwu; Pohua P. Chang; Tokuzo Kiyohara

A set of compiler transformations designed to increase instruction-level parallelism is described. The effectiveness of these transformations is evaluated using 40 loop nests extracted from a range of supercomputer applications. This evaluation shows that increasing execution resources in superscalar/VLIW node processors yields little performance improvement unless loop unrolling and register renaming are applied. It also reveals that these two transformations are sufficient for DOALL loops. However, more advanced transformations are required in order for serial and DOACROSS loops to fully benefit from the increased execution resources. The results show that the six additional transformations studied satisfy most of this need.<<ETX>>


languages and compilers for parallel computing | 1992

Using Profile Information to Assist Advaced Compiler Optimization and Scheduling

William Y. Chen; Roger A. Bringmann; Scott A. Mahlke; Sadun Anik; Tokuzo Kiyohara; Nancy J. Warter; Daniel M. Lavery; Wen-mei W. Hwu; Richard E. Hank; John C. Gyllenhaal

Compilers for superscalar and VLIW processors must expose sufficient instruction-level parallelism in order to achieve high performance. Compiletime code transformations which expose instruction-level parallelism typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instructionlevel parallelism along the frequent execution scenario at the expense of the less frequent execution sequences. Profile information identifies these important execution sequences in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-based transformations have been incorporated into the IMPACT compiler. These transformations include global optimization, acyclic global scheduling, and software pipelining. The effectiveness of these profile-based techniques is evaluated for a range of superscalar and VLIW processors.


international conference on supercomputing | 1992

Tolerating data access latency with register preloading

William Y. Chen; Scott A. Mahlke; Wen-mei W. Hwu; Tokuzo Kiyohara; Pohua P. Chang

By exploiting fine grain parallelism, superscalar processors can potentially increase the performance of future supercomputers. However, supercomputers typically have a long access delay to their first level memory which can severely restrict the performance of superscalar processors. Compilers attempt to move load instructions far enough ahead to hide this latency. However, conventional movement of load instructions is limited by data dependence analysis. This paper introduces a simple hardware scheme, referred to as preload register update, to allow the compiler to move load instructions even in the presence of inconclusive data dependence analysis results. Preload register update keeps the load destination registers coherent when load instructions are moved past store instructions that reference the same location. With this addition, superscalar processors can more effectively tolerate longer data access latencies.


international symposium on microarchitecture | 1992

Code scheduling for VLIW/superscalar processors with limited register files

Tokuzo Kiyohara; John C. Gyllenhaal

Moderate size register files can limit the performance of loop unrolling on multiple issue processors. With current scheduling heuristics, a breadth-first scheduling of iterations occurs, increasing register pressure and generating excessive spill code. A heuristic is proposed that causes a more depthfirst scheduling of unrolled iterations. This heuristic reduces the overlapping of the unrolled iterations and as a result, reduces register pressure. The experimental evaluation shows increased performance on processors with 9% or 64 registers. In addition, the performance of dependency removing optimizations is stabilized, so that applying additional optimizations is more likely to increase performance.


international conference on consumer electronics | 2006

A pixel level parallel processing architecture for multi-standard video codec

T. Tanaka; T. Furuta; H. Nishida; K. Yoshioka; Tokuzo Kiyohara

This paper presents an integrated platform for digital AV products. By applying all the parallelisms of pixel level processes, we developed a parallel processing architecture, which is a core component of the platform. This architecture can execute pixel level processes for MPEG2, MPEG4, JPEG, and H.264 codecs.


international conference on consumer electronics | 1998

Software Control Of I/o Subsystem On Media Core Processor

Makoto Hirai; Tetsuji Mochida; Tsutomu Hashimoto; Eiko Fujii; Tokuzo Kiyohara

To reduce the cost of a DVD player, we have integrated peripheral functions to the audio and video decoder LSI. The peripheral functions are controlled by the I/O processing tasks on single I/O control processor. I/O processing tasks are switched in constant cycle without overhead, to achieve real-time performance and flexibility of software control. In the case of video output handling, software can control each line parameter. Therefore software with line level support of hardware can perform many kinds of functions; image resizing, copy guard and dynamic blending of the image and graphics.


international symposium on industrial electronics | 1994

Evaluation method of microarchitecture for multithreaded processor

Kozo Kimura; Hiroaki Hirata; Tokuzo Kiyohara; S. Ashara; Takayuki Sagishima; Takao Onoye; Isao Shirakawa

A multithreaded processor is a good approach to increase the performance by utilizing coarse grain parallelism. The execution of multiple threads in parallel makes a performance prediction difficult because of a complicated behavior. Thus instruction-level simulation is necessary for a performance evaluation. In practice, it is very difficult to select optimum configuration of microarchitecture through a simulation of wide variety of candidates because of a long simulation time. The paper presents an evaluation method of microarchitecture for multithreaded processors. The method consists of three steps; first, the characteristics of the application are analysed, secondly, the candidates of microarchitecture are selected in consideration of the characteristics, lastly, the selected architectures are evaluated through the instruction-level simulation using practical application program. The experimental results using computer graphics application show that the proposed evaluation method of microarchitecture are very effective in order to increase the performance of multithreaded processors.<<ETX>>


Proceedings of the Fifth TRON Project Symposium on TRON Project 1988: open-architecture computer systems | 1989

Design considerations of the Matsushita 32-bit microprocessor for real-memory systems

Tokuzo Kiyohara; Takashi Sakao; Kozo Adachi; Osamu Nishijima

A high-speed, high-performance 32-bit microprocessor for real-memory systems is described that is now being developed based upon TRONCHIP specifications. The principle applications of this processor are as a high-performance controller for control devices, communications, networks, graphics, etc.; as a high-performance specific LSI core; and as a CPU for personal computers, word processors, etc. It has been designed to achieve a performance of 8 MIPS (million instructions per second) at a 20 MHz clock rate within the chip size which makes future ASIC development possible. As a result, design is now progressing toward one-clock execution of register-to-register instructions, speeded up load and store instructions, reduced number of pipeline stages (4 stages and a store buffer), and incorporation of the instruction cache.


international symposium on circuits and systems | 1994

Multithreaded processor for image generation

Takayuki Sagishima; Kozo Kimura; Hiroaki Hirata; Tokuzo Kiyohara; Shigeo Asahara; Takao Onoye; Isao Shirakawa

Multiple instruction execution is a major approach to designing high-performance processors. Superscalar and VLIW processor that utilize instruction level parallelism are usually focused on. On the other hand, the multithreaded processor can be expected to achieve a high degree of multiple instruction execution by utilizing coarse grain parallelism. Many computer graphics applications (such as the radiosity method and ray-tracing method) can be optimized by reorganizing the code to take advantage of coarse grain parallelism, but the degree of instruction level parallelism is not sufficient for a superscalar processor. Experimental result using the radiosity method shows that the 4-thread multithreaded processor achieves 2.9 times speedup over single thread, while the 4-issue superscalar processor manages around 1.5 times. By duplicating two kinds of function units, the performance of a multithreaded processor increases to 3.7 times, but the performance of a superscalar processor is saturated at around 1.5 times. Therefore, for computer graphics applications, the multithreaded processor is a better approach than the superscalar processor.<<ETX>>

Collaboration


Dive into the Tokuzo Kiyohara's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge