Naoya Onizawa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Naoya Onizawa is active.

Explore More

Publication

Featured researches published by Naoya Onizawa.

IEEE Transactions on Very Large Scale Integration Systems | 2010

Design of High-Throughput Fully Parallel LDPC Decoders Based on Wire Partitioning

Naoya Onizawa; Takahiro Hanyu; Vincent C. Gaudet

We present a method to design high-throughput fully parallel low-density parity-check (LDPC) decoders. With our method, a decoders longest wires are divided into several short wires with pipeline registers. Log-likelihood ratio messages transmitted along with these pipelined paths are thus sent over multiple clock cycles, and the decoders critical path delay can be reduced while maintaining comparable bit error rate performance. The number of registers inserted into paths is estimated by using wiring information extracted from initial placement and routing information with a conventional LDPC decoder, and thus only necessary registers are inserted. Also, by inserting an even number of registers into the longer wires, two different codewords can be simultaneously decoded, which improves the throughput at a small penalty in area. We present our design flow as well as post-layout simulation results for several versions of a length-1024, (3,6)-regular LDPC code. Using our technique, we achieve a maximum uncoded throughput of 13.21 Gb/s with an energy consumption of 0.098 nJ per uncoded bit at E b/N0 = 5 dB. This represents a 28% increase in throughput, a 30% decrease in energy per bit, and a 1.6% increase in core area with respect to a conventional parallel LDPC decoder, using a 90-nm CMOS technology.

international symposium on circuits and systems | 2012

Architecture and implementation of an associative memory using sparse clustered networks

Hooman Jarollahi; Naoya Onizawa; Vincent Gripon; Warren J. Gross

Associative memories are alternatives to indexed memories that when implemented in hardware can benefit many applications such as data mining. The classical neural network based methodology is impractical to implement since in order to increase the size of the memory, the number of information bits stored per memory bit (efficiency) approaches zero. In addition, the length of a message to be stored and retrieved needs to be the same size as the number of nodes in the network causing the total number of messages the network is capable of storing (diversity) to be limited. Recently, a novel algorithm based on sparse clustered neural networks has been proposed that achieves nearly optimal efficiency and large diversity. In this paper, a proof-of-concept hardware implementation of these networks is presented. The limitations and possible future research areas are discussed.

international symposium on turbo codes and iterative information processing | 2016

VLSI implementation of deep neural networks using integral stochastic computing

Arash Ardakani; François Leduc-Primeau; Naoya Onizawa; Takahiro Hanyu; Warren J. Gross

The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention since many applications require high-speed operations. However, numerous processing elements and complex interconnections are usually required, leading to a large area occupation and a high power consumption. Stochastic computing has shown promising results for area-efficient hardware implementations, even though existing stochastic algorithms require long streams that exhibit long latency. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture uses integer stochastic streams and a modified Finite State Machine-based tanh function to improve the performance and reduce the latency compared to existing stochastic architectures for DNN. The simulation results show the negligible performance loss of the proposed integer stochastic DNN for different network sizes compared to their floating point versions.

IEEE Signal Processing Letters | 2015

Gabor Filter Based on Stochastic Computation

Naoya Onizawa; Daisaku Katagiri; Kazumichi Matsumiya; Warren J. Gross; Takahiro Hanyu

This letter introduces a design and proof-of-concept implementation of Gabor filters based on stochastic computation for area-efficient hardware. The Gabor filter exhibits a powerful image feature extraction capability, but it requires significant computational power. Using stochastic computation, a sine function used in the Gabor filter is approximated by exploiting several stochastic tanh functions designed based on a state machine. A stochastic Gabor filter realized using the stochastic sine shaper and a stochastic exponential function is simulated and compared with the original Gabor filter that shows almost equivalent behaviour at various frequencies and variance. A root-mean-square error of 0.043 at most is observed. In order to reduce long latency due to stochastic computation, 68 parallel stochastic Gabor filters are implemented in Silterra 0.13 μm CMOS technology. As a result, the proposed Gabor filters achieve a 78% area reduction compared with a conventional Gabor filter while maintaining the comparable speed.

application specific systems architectures and processors | 2013

A low-power Content-Addressable Memory based on clustered-sparse networks

Hooman Jarollahi; Vincent Gripon; Naoya Onizawa; Warren J. Gross

A low-power Content-Addressable Memory (CAM) is introduced employing a new mechanism for associativity between the input tags and the corresponding address of the output data. The proposed architecture is based on a recently developed clustered-sparse network using binary-weighted connections that on-average will eliminate most of the parallel comparisons performed during a search. Therefore, the dynamic energy consumption of the proposed design is significantly lower compared to that of a conventional low-power CAM design. Given an input tag, the proposed architecture computes a few possibilities for the location of the matched tag and performs the comparisons on them to locate a single valid match. A 0.13μm CMOS technology was used for simulation purposes. The energy consumption and the search delay of the proposed design are 9.5%, and 30.4% of that of the conventional NAND architecture respectively with a 3.4% higher number of transistors.

ieee international symposium on asynchronous circuits and systems | 2012

High-Throughput Low-Energy Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism

Naoya Onizawa; Shoun Matsunaga; Vincent C. Gaudet; Takahiro Hanyu

This paper introduces a self-timed overlapped search mechanism for high-throughput content-addressable memories (CAMs) with low search energy. Most mismatches can be found by searching the first few bits in a search word. Consequently, if a word circuit is divided into two sections that are sequentially searched, most match lines in the second section are unused. As searching the first section is faster than searching an entire word, we could potentially increase throughput by initiating a second-stage search on the unused match lines as soon as a first-stage search is complete. The overlapped search mechanism is realized using a self-timed word circuit that is independently controlled by a locally generated control signal, reducing the power dissipation of global clocking. A 256 x 144-bit CAM is designed under in 90 nm CMOS that operates with 5.57x faster throughput than a synchronous CAM, with 38% energy saving and 8% area overhead.

international conference on acoustics, speech, and signal processing | 2013

Reduced-complexity binary-weight-coded associative memories

Hooman Jarollahi; Naoya Onizawa; Vincent Gripon; Warren J. Gross

Associative memories retrieve stored information given partial or erroneous input patterns. Recently, a new family of associative memories based on Clustered-Neural-Networks (CNNs) was introduced that can store many more messages than classical Hopfield-Neural Networks (HNNs). In this paper, we propose hardware architectures of such memories for partial or erroneous inputs. The proposed architectures eliminate winner-take-all modules and thus reduce the hardware complexity by consuming 65% fewer FPGA lookup tables and increase the operating frequency by approximately 1.9 times compared to that of previous work.

IEEE Transactions on Nanotechnology | 2016

Analog-to-Stochastic Converter Using Magnetic Tunnel Junction Devices for Vision Chips

Naoya Onizawa; Daisaku Katagiri; Warren J. Gross; Takahiro Hanyu

This paper introduces an analog-to-stochastic converter using a magnetic tunnel junction (MTJ) device for vision chips based on stochastic computation. Stochastic computation has been recently exploited for area-efficient hardware implementation, such as low-density parity-check decoders and image processors. However, power-and-area hungry two-step (analog-to-digital and digital-to-stochastic) converters are required for the analog to stochastic signal conversion. To realize a one-step conversion, an MTJ device is used as it inherently exhibits a probabilistic switching behavior between two resistance states. Exploiting the device-based probabilistic behavior, analog signals can be directly and area efficiently converted to stochastic signals to mitigate the signal-conversion overhead. The analog-to-stochastic signal conversion is theoretically described and the conversion characteristic is evaluated using device and circuit parameters. In addition, the resistance variability of the MTJ device is considered in order to compensate the variability effect on the signal conversion. Based on the theoretical analysis, the analog-to-stochastic converter is designed in 90-nm CMOS and 100-nm MTJ technologies and is verified using a SPICE simulator (NS-SPICE) that handles both transistors and MTJ devices.

IEEE Transactions on Circuits and Systems | 2011

Low-Energy Asynchronous Interleaver for Clockless Fully Parallel LDPC Decoding

Naoya Onizawa; Vincent C. Gaudet; Takahiro Hanyu

This paper presents a low-energy asynchronous interleaver for clockless fully parallel low-density parity-check (LDPC) decoding. The proposed data-transmission circuit based on a half-duplex single-track protocol makes it possible to realize a wire-efficient asynchronous interleaver with small energy consumption. Moreover, a data-monitoring system adaptively shuts down the asynchronous data-transmission circuit if not necessary, which reduces the number of data transmissions and, hence, the energy consumed. The clockless decoder with the proposed asynchronous interleaver is evaluated using a (1056,528) irregular LDPC code under a 90-nm CMOS process. As a result, the energy dissipation per uncoded bit at Eb/No of 5 dB becomes 54 pJ/bit with an uncoded throughput of 45.5 Gbps under a postlayout simulation. This represents a 92% decrease in energy per bit and a 1143% in throughput increase with respect to our previous clockless LDPC decoder.

IEEE Transactions on Very Large Scale Integration Systems | 2015

Algorithm and Architecture for a Low-Power Content-Addressable Memory Based on Sparse Clustered Networks

Hooman Jarollahi; Vincent Gripon; Naoya Onizawa; Warren J. Gross

We propose a low-power content-addressable memory (CAM) employing a new algorithm for associativity between the input tag and the corresponding address of the output data. The proposed architecture is based on a recently developed sparse clustered network using binary connections that on-average eliminates most of the parallel comparisons performed during a search. Therefore, the dynamic energy consumption of the proposed design is significantly lower compared with that of a conventional low-power CAM design. Given an input tag, the proposed architecture computes a few possibilities for the location of the matched tag and performs the comparisons on them to locate a single valid match. TSMC 65-nm CMOS technology was used for simulation purposes. Following a selection of design parameters, such as the number of CAM entries, the energy consumption and the search delay of the proposed design are 8%, and 26% of that of the conventional NAND architecture, respectively, with a 10% area overhead. A design methodology based on the silicon area and power budgets, and performance requirements is discussed.

Explore More