Hiroki Nakahara | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hiroki Nakahara is active.

Explore More

Publication

Featured researches published by Hiroki Nakahara.

field programmable logic and applications | 2015

A deep convolutional neural network based on nested residue number system

Hiroki Nakahara; Tsutomu Sasao

A pre-trained deep convolutional neural network (DCNN) is the feed-forward computation perspective which is widely used for the embedded vision systems. In the DCNN, the 2D convolutional operation occupies more than 90% of the computation time. Since the 2D convolutional operation performs massive multiply-accumulation (MAC) operations, conventional realizations could not implement a fully parallel DCNN. The RNS decomposes an integer into a tuple of L integers by residues of moduli set. Since no pair of modulus have a common factor with any other, the conventional RNS decomposes the MAC unit into circuits with different sizes. It means that the RNS could not utilize resources of an FPGA with uniform size. In this paper, we propose the nested RNS (NRNS), which recursively decompose the RNS. It can decompose the MAC unit into circuits with small sizes. In the DCNN using the NRNS, a 48-bit MAC unit is decomposed into 4-bit ones realized by look-up tables of the FPGA. In the system, we also use binary to NRNS converters and NRNS to binary converters. The binary to NRNS converter is realized by on-chip BRAMs, while the NRNS to binary one is realized by DSP blocks and BRAMs. Thus, a balanced usage of FPGA resources leads to a high clock frequency with less hardware. The ImageNet DCNN using the NRNS is implemented on a Xilinx Virtex VC707 evaluation board. As for the performance per area GOPS (Giga operations per second) per a slice, the proposed one is 5.86 times better than the existing best realization.

symposium on vlsi circuits | 2017

BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm CMOS

Kota Ando; Kodai Ueyoshi; Kentaro Orimo; Haruyoshi Yonekawa; Shimpei Sato; Hiroki Nakahara; Masayuki Ikebe; Tetsuya Asai; Shinya Takamaeda-Yamazaki; Tadahiro Kuroda; Masato Motomura

A versatile reconfigurable accelerator for binary/ternary deep neural networks (DNNs) is presented. It features a massively parallel in-memory processing architecture and stores varieties of binary/ternary DNNs with a maximum of 13 layers, 4.2 K neurons, and 0.8 M synapses on chip. The 0.6 W, 1.4 TOPS chip achieves performance and energy efficiency that is 10–102 and 102–104 times better than a CPU/GPU/FPGA.

international parallel and distributed processing symposium | 2017

On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA

Haruyoshi Yonekawa; Hiroki Nakahara

A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires highly power-and-area efficiency. This paper proposes a binarized CNN on an FPGA which treats only binary 2-values∼(+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose a batch normalization free binarized CNN which is mathematically equivalent to one using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the Xilinx Inc. Zynq UltraScale+ MPSoC zcu102 evaluation board. Our binarized CNN stores all the weights, inputs, and output to on-chip BRAMs those are faster and dissipate lower power than an off-chip memory, such as a DDR4SDRAM. Compared with the conventional FPGA realizations, although the classification accuracy is 6.5% decayed, the performance is 2.45 times faster, the power efficiency is slightly better, and the area efficiency is 2.68 times better. Compared with the ARM Cortex-A57, it is 136.8 times faster, it dissipates 3.1 times much power, and its performance per power efficiency is 44.7 times better. Also, compared with the Maxwell embedded GPU, it is 4.9 times faster, it dissipates 1.3 times much power, and its performance per power efficiency is 3.8 times better. Thus, our method is suitable for the embedded computer system.

field programmable gate arrays | 2017

A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only)

Hiroki Nakahara; Haruyoshi Yonekawa; Hisashi Iwamoto; Masato Motomura

A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires high power-and-area efficiency. This paper realizes a binarized CNN which treats only binary 2-values (+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose the batch normalization free CNN which is mathematically equivalent to the CNN using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the NetFPGA-SUME FPGA board, which has the Xilinx Inc. Virtex7 FPGA and three off-chip QDR II+ Synchronous SRAMs. Compared with the conventional FPGA realizations, although the classification error rate is 6.5% decayed, the performance is 2.82 times faster, the power efficiency is 1.76 times lower, and the area efficiency is 11.03 times smaller. Thus, our method is suitable for the embedded computer system.

international symposium on multiple-valued logic | 2015

An RNS FFT Circuit Using LUT Cascades Based on a Modulo EVMDD

Hiroki Nakahara; Tsutomu Sasao; Hiroyuki Nakanishi; Kazumasa Iwai

This paper proposes an FFT circuit based on a residue number system (RNS) using LUT cascades. To reduce the number of look-up tables (LUTs) in an FPGA, we used two techniques. The first one is the functional decomposition of multipliers using RNS. The second one is the increase of the dynamic range stage by stage. The circuit requires the RNS2RNS converter which converts a small dynamic range to a large dynamic range. To compactly realize the RNS2RNS converter, we decompose it into an RNS2Binary converter and a Binary2RNS converter. Although the Binary2RNS converter can be realized by an LUT cascade based on a multi-terminal multi-valued decision diagram (MTMDD), the RNS2Binary converter tend to be large for the conventional circuit. Thus, we introduce an LUT cascade based on a modulo edge-valued multi-valued decision diagram (mod-EVMDD). The mod-EVMDD is a new type of a decision diagram that efficiently represents the RNS2Binary converter. We implemented the proposed RNS FFT on the Xilinx Corp. Virtex 6 FPGA. Compared with the conventional binary FFT implementation, although the number of block RAMs (BRAMs) increased by 11.1-25.0%, the number of LUTs decreased by 44.2-52.2% and the maximum clock frequency increased by 9.3-41.7%. With this technique, we successfully implemented a required FFT on an available FPGA, since the excessive number of LUTs was the bottleneck of the binary FFT.

field-programmable technology | 2016

A memory-based realization of a binarized deep convolutional neural network

Hiroki Nakahara; Haruyoshi Yonekawa; Tsutomu Sasao; Hisashi Iwamoto; Masato Motomura

A pre-trained deep convolutional neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires high power-and-area efficiency. This paper realizes a binarized CNN which treats only binary 2-values (+1/−1) for the inputs and the weights. In this case, the multiplier is replaced with an EX-NOR circuit. To reduce both power and area, we realize the 2-valued CNN by off- and on-chip memories. Since our 2D convolution operations are realized by the on-chip memory, our implementation consumes lower power than the DSP block. We decompose the memory part, and realize them by a cascade of memories (LUT cascade). By introducing a batch normalization technique, the classification error for the binarized CNN can be improved. We implemented the CIFAR-10 benchmark on the NetFPGA-SUME board, which has the Xilinx Inc. Virtex 7 FPGA and three on-chip QDR II+ Synchronous SRAMs. Compared with the conventional FPGA realizations, the performance is 2.82 times faster, the power efficiency is 1.76 times, and the area efficiency is 29.13 times better.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2016

LUT Cascades Based on Edge-Valued Multi-Valued Decision Diagrams: Application to Packet Classification

Hiroki Nakahara; Tsutomu Sasao; Hisashi Iwamoto; Munehiro Matsuura

This paper presents a packet classifier using multiple LUT cascades for edge-valued multi-valued decision diagrams (EVMDDs (k)). Since the proposed one uses both DSP blocks and on-chip memories, it can efficiently use the available FPGA resources. Thus, it can realize a parallel packet classifier on a single-chip FPGA for the next generation 400 Gb/s Internet link rate (IEEE 802.3). Since it is a memory-based one, the power consumption is lower than the TCAM-based one. Also, we proposed an on-line update method that can be done without intermitting the packet classification. Compared with the conventional off-line update which requires resynthesis of the re-generated HDL codes, it drastically reduces the update time. Although the proposed on-line update requires additional hardware, the overhead is only 8.5% of the original LUT cascades, which is acceptable. We implemented a two-parallel packet classifier on a Virtex 7 VC707 evaluation board. The system throughput is 640 Gb/s for minimum packet size (40 Bytes). For the performance per memory, the proposed architecture is 2.21 times higher than existing methods. For the power consumption per performance, the proposed architecture is 11.95 times lower than existing methods.

field programmable gate arrays | 2018

A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an FPGA

Hiroki Nakahara; Haruyoshi Yonekawa; Tomoya Fujii; Shimpei Sato

A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones - all of which require high-performance and low-power consumption. This paper implements the YOLO (You only look once) object detector on an FPGA, which is faster and has a higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part both the performance and the area. However, the object detector based on the CNN consists of a bounding box prediction (regression) and a class estimation (classification). Thus, the conventional all binarized CNN fails to recognize in most cases. In the paper, we propose a lightweight YOLOv2, which consists of the binarized CNN for a feature extraction and the parallel support vector regression (SVR) for both a classification and a localization. To our knowledge, this is the first time binarized CNN»s have been successfully used in object detection. We implement a pipelined based architecture for the lightweight YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 40.81 frames per second (FPS). Compared with the ARM Cortex-A57, it was 177.4 times faster, it dissipated 1.1 times more power, and its performance per power efficiency was 158.9 times better. Also, compared with the nVidia Pascall embedded GPU, it was 27.5 times faster, it dissipated 1.5 times lower power, and its performance per power efficiency was 42.9 times better. Thus, our method is suitable for the frame object detector for an embedded vision system.

field programmable logic and applications | 2017

A fully connected layer elimination for a binarizec convolutional neural network on an FPGA

Hiroki Nakahara; Tomoya Fujii; Shimpei Sato

A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1s counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.

asian conference on defence technology | 2017

Performance evaluation of different inpainting algorithms for remotely sensed images

Luqman Ali; Teerasit Kasetkasem; Wasif Khan; Thitiporn Chanwimaluang; Hiroki Nakahara

Image inpainting refers to a technique in which missing areas of an image are filled is such a way that it looks plausible to the human eye by retrieving information from the surrounding pixels. Degradation in remote sensing images is usually caused by dead pixels, noise, clouds, sensor problem or communication system problem. The aim of this paper is to evaluate the performance of different inpainting algorithms for remote sensing images. Satellite Images are tested on different inpainting algorithms and efficiency of each algorithm is evaluated on the basis of processing time, Root mean square error and peak signal to noise ratio.

Explore More