Kouhei Nadehara
NEC
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kouhei Nadehara.
international symposium on microarchitecture | 1998
Kazumasa Suzuki; Tomohisa Arai; Kouhei Nadehara; Ichiro Kuroda
The V830R/AVs real-time decoding of MPEG-2 video and audio data enables practical embedded-processor-based multimedia systems.
international symposium on microarchitecture | 1995
Kouhei Nadehara; Ichiro Kuroda; Masayuki Daito; Takashi Nakayama
Battery-powered multimedia systems challenge designers to pack enormous signal-processing power into a low-power chip. The V830 chip achieves this by combining a special instruction set and fast, 32-bit parallel multiply-adder with internal RAM for quicker memory accesses. This gives the newest entry in the V800 series signal-processing capabilities as fast as the latest fixed-point DSPs.
signal processing systems | 2004
Kouhei Nadehara; Masao Ikekawa; Ichiro Kuroda
In this paper, extended instructions for the advanced encryption standard (AES) cryptography acceleration in embedded processors and efficient implementation of these instructions are presented. These AES instructions generate four elements in single-instruction, multiple-data format from each input of an AES state. The instruction count for 128-bit key AES encryption can be reduced from 688 to 340 per 128-bit block by using the proposed AES instructions. The execution unit for the AES instructions can be implemented efficiently with a single 2-Kbit table and four small multipliers. The capacity of the table has been reduced to 1/32, compared to that of a conventional fast software algorithm. The AES instructions enable embedded processors for low-cost network equipment to have cryptographic capability with minimal modification.
international conference on acoustics speech and signal processing | 1999
Kouhei Nadehara; Takashi Miyazaki; Ichiro Kuroda
A fast radix-4 complex FFT implementation using 4-parallel SIMD instructions is presented. Four radix-4 butterflies are calculated in parallel at all stages by loading consecutive 4 elements into a register. At the last stage, every 4 elements is packed into a register and calculated in parallel. This regular data flow enables higher parallelism and an overhead reduction in data format conversion. The implementation result on the V830R processor, which has a 4-parallel SIMD-type multimedia instruction set, achieves practical performance quite competitive with high-end parallel DSPs. Multiply-accumulate instructions with symmetrical rounding introduced to the V830R processor are effective to maintain FFT accuracy.
IEEE Journal of Solid-state Circuits | 1999
Kazumasa Suzuki; Masayuki Daito; Tomoo Inoue; Kouhei Nadehara; Masahiro Nomura; Masayuki Mizuno; Tomofumi Iima; Shoichiro Sato; Terumi Fukuda; Tomohisa Arai; Ichiro Kuroda; Masakazu Yamashina
We have developed a 0.25-/spl mu/m, 200-MHz embedded RISC processor for multimedia applications. This processor has a dual-issue superscalar datapath that consists of a 32-bit integer unit and a 64-bit single-instruction multiple-data (SIMD) function unit that together have a total of five multiply-adders. An on-chip concurrent Rambus DRAM (C-RDRAM) controller uses interleaved transactions to increase the memory bandwidth of the Rambus channel to 533 Mb/s. The controller also reduces latency by using the transaction interleaving and instruction prefetching. A 64-bit, 200-MHz internal bus transfers data among the CPU core, the C-RDRAM, and the peripherals. These high-data-rate channels improve CPU performance because they eliminate a bottleneck in the data supply. The datapath part of this chip was designed using a functional macrocell library that included placement information for leaf cells and resulted in the SIMD function unit of this chips having 68000 transistors per square millimeter.
signal processing systems | 1998
Ichiro Kuroda; E. Murata; Kouhei Nadehara; Kazumasa Suzuki; T. Arai; A. Okamura
This paper presents a parallel MAC (multiply-accumulation) architecture designed for DSP applications on a 200-MHz, 1.6-GOPS multimedia RISC processor. The datapath architecture of the processor is designed to realize parallel execution of a data transfer and SIMD parallel arithmetic operations. SIMD parallel 16-bit MAC instructions are introduced with a symmetric rounding scheme which maximizes the accuracy of the 18-bit accumulation. This parallel 16-bit MAC instruction on a 64-bit datapath is shown to be efficiently utilized for DSP applications such as convolution in the multimedia RISC processor. By using the parallel MAC instruction with the symmetric rounding scheme, the two-dimensional inverse discrete cosine transform (2D-IDCT) which satisfies IEEE 1180 can be implemented in 202 cycles.
VLSI Signal Processing, IX | 1996
Kouhei Nadehara; H.J. Stolberg; Masao Ikekawa; E. Murata; I. Kuroda
This paper presents a real-time MPEC-1 video decoder implemented in software on a DSP-enhanced, 160-mW, 100-MHz, 32-bit microprocessor. The processors DSP-oriented instructions improves the performance of generic DSP operations such as the inverse discrete cosine transform, while fast software algorithms that perform parallel operation on packed-pixel data are developed for processes unique to video decoding such as motion compensation. Furthermore, to reduce the clock count as well as the instruction count, load/store scheduling and cache miss reduction are performed. In total, the processor can achieve 30 frames/sec MPEC-1 video decoding at a cost and power dissipation (160 mW) comparable to dedicated LSIs.
signal processing systems | 1999
Yuichiro Takamizawa; Kouhei Nadehara; Max Boegli; Masao Ikekawa; Ichiro Kuroda
Presented here is MPEG-2 AAC low complexity profile decoder software for a low-power embedded RISC microprocessor, NEC V830 (300 mW @133 MHz). Fast processing methods for IMDCT reduce execution time by 41% and help achieve real-time decoding of a 5.1-channel audio signal, while using only 64.7% of processor capacity.
international conference on multimedia and expo | 2001
Masao Ikekawa; Masatsugu Hori; Kouhei Nadehara; Takahiro Kumura; Makoto Yoshida; Ichiro Kuroda; Takao Nishitani
This paper describes an efficient architecture enhancement for video codec on a new-generation, general-purpose digital signal processor (DSP) core called SPXK5 developed for handheld devices. With high performance features of SPXK5s base architecture, an MPEG-4 video codec can be implemented efficiently. In addition, only a few SIMD type instructions effectively accelerate MPEG-4 video codec implementation by 20% with only 2.5% hardware increase. By reducing cycle count, the DSPs power consumption can be reduced. Both video and speech codec for 3G mobile service at 384kbps can be realized with a power consumption of less than 50mW.
Archive | 2004
Kouhei Nadehara