Kazunori Saito | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kazunori Saito is active.

Explore More

Publication

Featured researches published by Kazunori Saito.

international solid state circuits conference | 2007

The Design and Implementation of the Massively Parallel Processor Based on the Matrix Architecture

Hideyuki Noda; Masami Nakajima; Katsumi Dosaka; Kiyoshi Nakata; Motoki Higashida; Osamu Yamamoto; Katsuya Mizumoto; Tetsushi Tanizaki; Takayuki Gyohten; Yoshihiro Okuno; Hiroyuki Kondo; Yukihiko Shimazu; Kazutami Arimoto; Kazunori Saito; Toru Shimizu

This paper describes the design and implementation of the massively parallel processor based on the matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves the high performance of 40 GOPS in the case of consecutive fixed-point 16-bit additions at 200MHz clock frequency and the small power dissipation of 250mW. In addition, 1Mbit SRAM for data registers and 2048 2-bit-grained processing elements connected by a flexible switching network are integrated in the small area of 3.1 mm 2 in 90nm CMOS low standby technology. These design techniques and architectures described in this paper are attractive for realizing area-efficient, energy-efficient, and high-performance multimedia processors

international symposium on circuits and systems | 2005

CAM-based VLSI architecture for Huffman coding with real-time optimization of the code word table [image coding example]

Takeshi Kumaki; Yasuto Kuroda; Tetsushi Koide; Hans Jürgen Mattausch; Hideyuki Noda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

Huffman coding is probably the best known and most widely used data compression technique. Nevertheless, the task of further decreased compression ratio through Huffman code up-dating in real-time is still a largely unsolved problem. In this paper, a novel architecture for CAM (content addressable memory)-based Huffman coding with real-time optimization of the code word table, called CHRC, is proposed. A CAM is exploited to implement fast Huffman encoding, and simultaneously the code word table is reconstructed and up-dated in realtime. The effectiveness of the proposed architecture is verified by structure, encoding flow and simulation results. The example of a JPEG application shows that our proposed CHRC method is able to achieve up to 40% smaller encoded picture sizes, and 6 times smaller clock cycle number for the encoding hardware than conventional Huffman coding methods.

IEICE Transactions on Information and Systems | 2007

Acceleration of DCT Processing with Massive-Parallel Memory-Embedded SIMD Matrix Processor

Takeshi Kumaki; Masakatsu Ishizaki; Tetsushi Koide; Hans Jürgen Mattausch; Yasuto Kuroda; Hideyuki Noda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

This paper reports an efficient Discrete Cosine Transform (DCT) processing method for images using a massive-parallel memory-embedded SIMD matrix processor. The matrix-processing engine has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. For compatibility with this matrix-processing architecture, the conventional DCT algorithm has been improved in arithmetic order and the vertical/horizontal-space 1 Dimensional (1D)-DCT processing has been further developed. Evaluation results of the matrix-engine-based DCT processing show that the necessary clock cycles per image block can be reduced by 87% in comprison to a conventional DSP architecture. The determined performances in MOPS and MOPS/mm2 are factors 8 and 5.6 better than with a conventional DSP, respectively.

IEICE Transactions on Electronics | 2008

Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor

Takeshi Kumaki; Masakatsu Ishizaki; Tetsushi Koide; Hans Jürgen Mattausch; Yasuto Kuroda; Takayuki Gyohten; Hideyuki Noda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

This paper presents an integration architecture of content addressable memory (CAM) and a massive-parallel memory-embedded SIMD matrix for constructing a versatile multimedia processor. The massive-parallel memory-embedded SIMD matrix has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. The SIMD matrix architecture is verified to be a better way for processing the repeated arithmetic operation types in multimedia applications. The proposed architecture, reported in this paper, exploits in addition CAM technology and enables therefore fast pipelined table-lookup coding operations. Since both arithmetic and table-lookup operations execute extremely fast, the proposed novel architecture can realize consequently efficient and versatile multimedia data processing. Evaluation results of the proposed CAM-enhanced massive-parallel SIMD matrix processor for the example of the frequently used JPEG image-compression application show that the necessary clock cycle number can be reduced by 86% in comparison to a conventional mobile DSP architecture. The determined performances in Mpixel/mm2 are factors 3.3 and 4.4 better than with a CAM-less massive-parallel memory-embedded SIMD matrix processor and a conventional mobile DSP, respectively.

IEICE Transactions on Information and Systems | 2007

Real-Time Huffman Encoder with Pipelined CAM-Based Data Path and Code-Word-Table Optimizer

Takeshi Kumaki; Yasuto Kuroda; Masakatsu Ishizaki; Tetsushi Koide; Hans Jürgen Mattausch; Hideyuki Noda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

This paper presents a novel optimized real-time Huffman encoder using a pipelined data path based on CAM technology and a parallel code-word-table optimizer. The exploitation of CAM technology enables fast parallel search of the code word table. At the same time, the code word table is optimized according to the frequency of received input symbols and is up-dated in real-time. Since these two functions work in parallel, the proposed architecture realizes fast parallel encoding and keeps a constantly high compression ratio. Evaluation results for the JPEG application show that the proposed architecture can achieve up to 28% smaller encoded picture sizes than the conventional architectures. The obtained encoding time can be reduced by 95% in comparison to a conventional SRAM-based architecture, which is suitable even for the latest end-user-devices requiring fast frame-rates. Furthermore, the proposed architecture provides the only encoder that can simultaneously realize small compressed data size and fast processing speed.

midwest symposium on circuits and systems | 2005

Multi-port CAM based VLSI architecture for Huffman coding with real-time optimized code word table

Takeshi Kumaki; Yasuto Kuroda; Tetsushi Koide; H. Jurgen Mattausch; Hideyuki Noda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

This paper presents a multi-port CAM based VLSI architecture for Huffman coding with real-time optimized code word table as a novel architecture for high-speed parallel Huffman coding. The multi-port CAM technology exploited is the FMCAM (flexible multi-port content addressable memory) architecture (Kumaki et al., 2004), which enables fast parallel Huffman encoding. At the same time, the code word table is reconstructed according to the frequency of received input symbols and is up-dated in real-time. Since two the functions work in parallel, the proposed architecture realizes fast parallel encoding and keeps a constantly high compression ratio. The simulation results for the JPEG application show that the proposed architecture can achieve up to 20% smaller encoded picture sizes, and four times reduced clock cycle numbers for the encoding hardware (8 port case) in comparison to conventional fast Huffman coding architectures.

international symposium on circuits and systems | 2007

Efficient Vertical/Horizontal-Space 1D-DCT Processing Based on Massive-Parallel Matrix-Processing Engine

Takeshi Kumaki; Tetsushi Koide; Hans Jürgen Mattausch; Yasuto Kuroda; Hideyuki Noda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

This paper reports an efficient discrete cosine transform (DCT) processing for the JPEG algorithm using a massive-parallel memory-embedded SIMD matrix processor. The matrix-processing engine has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. For compatibility with this matrix-processing architecture, the conventional DCT algorithm has been improved in arithmetic order and the vertical/horizontal-space 1 dimensional (1D)-DCT processing has been further developed. Evaluation results of the matrix-engine-based DCT processing show that the necessary clock cycles per image blocks can be reduced by 87% in comparison to a conventional DSP architecture. The determined performances in MOPS and MOPS/mm are factors 8 and 5.6 better than with a conventional DSP, respectively. Moreover, the matrix-processing engine can reduce the number of total clock cycles for JPEG application about 49% in comparison to a conventional DSP architecture.

ieee region 10 conference | 2006

Huffman Encoding Architecture with Self-Optimizing Performance and Multiple CAM-Match Utilization

Masakatsu Ishizaki; Takeshi Kumaki; Y. Kouno; Tetsushi Koide; Hans Jürgen Mattausch; Yasuto Kuroda; T. Gyoten; Hideyuki Noda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

This paper presents a method for achieving high speed and high compression ratio of Huffman encoding by updating and optimizing the code word table. A shadow code word table is continuously reconstructed according to the frequency distribution of the currently encoded symbols and used to replace the active code word table in real-time, if the compression ratio degrades. Multiple-matches in a content addressable memory (CAM) are additionally exploited to improve encoding speed to real-time requirements. A higher compression ratio can be obtained by optimizing the update timing. This paper also estimates the best method for the updating of the code word table by the simulation. As a result the compressed data size is up to 22.6% smaller than with the conventional Huffman encoding architecture, which uses a standard encoding table

midwest symposium on circuits and systems | 2007

CAM enhanced super parallel SIMD processor with high-speed pattern matching capability

Takeshi Kumaki; Yutaka Kono; Masakatsu Ishizaki; Masaharu Tagami; Tetsushi Koide; Hans Jürgen Mattausch; Takayuki Gyohten; Hideyuki Noda; Yasuto Kuroda; Katsumi Dosaka; Kazutami Arimoto; Kazunori Saito

A super parallel SIMD processor has been proposed as a novel SIMD multimedia processor, which is better way for processing several types multimedia data. This processor supports 2,048-way bit-serial and word-parallel operation. Moreover, 2,048 Processing elements can synchronize with a single command and process all stored data, thus achieving highly- parallel processing with low power consumption. In this paper, for further improving processing efficiency of multimedia data, a Content Addressable Memory (CAM) is embedded in the super parallel SIMD processor. In case of JPEG application, the original super parallel SIMD processor can reduce the number of average clock cycles about 55% smaller than for the conventional DSP. Furthermore, the clock cycle number with the CAM enhanced super parallel SIMD processor is 90% smaller than with the original super parallel SIMD processor. Thus, the proposed architecture is very suitable for the multimedia data processing.

Archive | 2005