Phil Knag | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Phil Knag is active.

Explore More

Publication

Featured researches published by Phil Knag.

IEEE Transactions on Nanotechnology | 2014

A Native Stochastic Computing Architecture Enabled by Memristors

Phil Knag; Wei Lu; Zhengya Zhang

A two-terminal memristor device is a promising digital memory for its high integration density, substantially lower energy consumption compared to CMOS, and scalability below 10 nm. However, a nanoscale memristor is an inherently stochastic device, and extra energy and latency are required to make a deterministic memory based on memristors. Instead of enforcing deterministic storage, we take advantage of the nondeterministic memory for native stochastic computing, where the randomness required by stochastic computing is intrinsic to the devices without resorting to expensive stochastic number generation. This native stochastic computing system can be implemented as a hybrid integration of memristor memory and simple CMOS stochastic computing circuits. We use an approach called group write to program memristor memory cells in arrays to generate random bit streams for stochastic computing. Three methods are proposed to program memristors using stochastic bit streams and compensate for the nonlinear memristor write function: voltage predistortion, parallel single-pulse write, and downscaled write and upscaled read. To evaluate these technical approaches, we show by simulation a memristor-based stochastic processor for gradient descent optimization, and k-means clustering. The native stochastic computing based on memristors demonstrates key advantages in energy and speed in compute-intensive, data-intensive, and probabilistic applications.

IEEE Journal of Solid-state Circuits | 2015

A Sparse Coding Neural Network ASIC With On-Chip Learning for Feature Extraction and Encoding

Phil Knag; Jung Kuk Kim; Thomas C. Chen; Zhengya Zhang

Hardware-based computer vision accelerators will be an essential part of future mobile devices to meet the low power and real-time processing requirement. To realize a high energy efficiency and high throughput, the accelerator architecture can be massively parallelized and tailored to vision processing, which is an advantage over software-based solutions and general-purpose hardware. In this work, we present an ASIC that is designed to learn and extract features from images and videos. The ASIC contains 256 leaky integrate-and-fire neurons connected in a scalable two-layer network of 8 × 8 grids linked in a 4-stage ring. Sparse neuron activation and the relatively small grid keep the spike collision probability low to save access arbitration. The weight memory is divided into core memory and auxiliary memory, such that the auxiliary memory is only powered on for learning to save inference power. High-throughput inference is accomplished by the parallel operation of neurons. Efficient learning is implemented by passing parameter update messages, which is further simplified by an approximation technique. A 3.06 mm2 65 nm CMOS ASIC test chip is designed to achieve a maximum inference throughput of 1.24 Gpixel/s at 1.0 V and 310 MHz, and on-chip learning can be completed in seconds. To improve the power consumption and energy efficiency, core memory supply voltage can be reduced to 440 mV to take advantage of the error resilience of the algorithm, reducing the inference power to 6.67 mW for a 140 Mpixel/s throughput at 35 MHz.

international symposium on circuits and systems | 2014

Memristive Devices for Stochastic Computing

Siddharth Gaba; Phil Knag; Zhengya Zhang; Wei Lu

We show resistive switching effects in memristive devices exhibit significant stochasticity. When the switching is dominated by a single filament, the switching time is fully random and shows a broad distribution. However, the switching distribution can be predicted and responds well to controlled changes in the programming conditions. The native stochastic characteristic can be used to generate random bit streams with predictable biases that can lead to efficient and error-tolerant computing.

IEEE Transactions on Signal Processing | 2014

Efficient Hardware Architecture for Sparse Coding

Jung Kuk Kim; Phil Knag; Thomas C. Chen; Zhengya Zhang

Sparse coding encodes natural stimuli using a small number of basis functions known as receptive fields. In this work, we design custom hardware architectures for efficient and high-performance implementations of a sparse coding algorithm called the sparse and independent local network (SAILnet). A study of the neuron spiking dynamics uncovers important design considerations involving the neural network size, target firing rate, and neuron update step size. Optimal tuning of these parameters keeps the neuron spikes sparse and random to achieve the best image fidelity. We investigate practical hardware architectures for SAILnet: a bus architecture that provides efficient neuron communications, but results in spike collisions; and a ring architecture that is more scalable, but causes neuron misfires. We show that the spike collision rate is reduced with a sparse spiking neural network, so an arbitration-free bus architecture can be designed to tolerate collisions without the need of arbitration. To reduce neuron misfires, we design a latent ring architecture to damp the neuron responses for an improved image fidelity. The bus and the ring architecture can be combined in a hybrid architecture to achieve both high throughput and scalability. The three architectures are synthesized and place-and-routed in a 65 nm CMOS technology. The proof-of-concept designs demonstrate a high sparse coding throughput up to 952 M pixels per second at an energy consumption of 0.486 nJ per pixel.

symposium on vlsi circuits | 2015

A 640M pixel/s 3.65mW sparse event-driven neuromorphic object recognition processor with on-chip learning

Jung Kuk Kim; Phil Knag; Thomas C. Chen; Zhengya Zhang

A 1.82mm2 65nm neuromorphic object recognition processor is designed using a sparse feature extraction inference module (IM) and a task-driven dictionary classifier. To achieve a high throughput, the 256-neuron IM is organized in four parallel neural networks to process four image patches and generate sparse neuron spikes. The on-chip classifier is activated by sparse neuron spikes to infer the object class, reducing its power by 88% and simplifying its implementation by removing all multiplications. A light-weight co-processor performs efficient on-chip learning by taking advantage of sparse neuron activity to save 84% of its workload and power. The test chip processes 10.16G pixel/s, dissipating 268mW. Integrated IM and classifier provides extra error tolerance for voltage scaling, lowering power to 3.65mW at a throughput of 640M pixel/s.

IEEE Transactions on Nuclear Science | 2014

Characterization of Heavy-Ion-Induced Single-Event Effects in 65 nm Bulk CMOS ASIC Test Chips

Chia Hsiang Chen; Phil Knag; Zhengya Zhang

Two 65 nm bulk complementary metal-oxide-semiconductor (CMOS) digital application-specific integrated circuit (ASIC) chips were designed, and then tested in a heavy ion accelerator to characterize single-event effects (SEE). Test chip 1 incorporates test structures, and test chip 2 implements an unhardened and a hardened digital signal processing (DSP) core. Our testing results reveal the radiation effects on the low-voltage and high-frequency operations of the ASIC chips. At a low supply voltage of 0.7 V, cross sections increase by a factor of 2 to 5 at low linear energy transfer (LET), while the increase in cross section at high LET is almost negligible, suggesting that the charge conveyed by heavy ion has far exceeded the critical charge and tuning the supply voltage is not effective. Increasing the clock frequency increases the relative importance of single-event transients (SET) compared to single-event upsets (SEU), especially in hardened designs due to their better SEU immunity. The hardened DSP core experiences a factor of 2 increase in cross section when its clock frequency is increased from 100 MHz to 500 MHz.

symposium on vlsi circuits | 2014

A 6.67mW sparse coding ASIC enabling on-chip learning and inference

Jung Kuk Kim; Phil Knag; Thomas C. Chen; Zhengya Zhang

A sparse coding ASIC is designed to learn visual receptive fields and infer the sparse representation of images for encoding, feature detection and recognition. 256 leaky integrate-and-fire neurons are connected in a 2-layer network of 2D local grids linked in a 4-stage systolic ring to reduce the communication latency. Spike collisions are kept sparse enough to be tolerated to save power. Memory is divided into a core section to support inference, and an auxiliary section that is only powered on for learning. An approximate learning tracks only significant neuron activities to save memory and power. The 3.06mm2 65nm CMOS ASIC achieves an inference throughput of 1.24Gpixel/s at 1.0V and 310MHz, and on-chip learning can be completed in seconds. Memory supply voltage can be reduced to 440mV to exploit the soft algorithm that tolerates errors, reducing the inference power to 6.67mW for a 140Mpixel/s throughput at 35MHz.

symposium on vlsi circuits | 2016

A 1.40mm 2 141mW 898GOPS sparse neuromorphic processor in 40nm CMOS

Phil Knag; Chester Liu; Zhengya Zhang

Sparsity is a brain-inspired property that enables a significant reduction in workload and power dissipation of deep learning. This work presents a 1.40mm2 40nm CMOS sparse neuromorphic processor that implements a two-layer convolutional restricted Boltzmann machine (CRBM) for inference and a support vector machine (SVM) classifier. The processor incorporates sparse convolvers to realize sparsity-proportional workload reduction. The architecture is parallelized along a non-sparse dimension to minimize stalling. At 0.9V and 240MHz, the processor achieves an effective 898.2GOPS performance, dissipating 140.9mW. Using sparsity, we reduce the workload, datapath power consumption and area by 3.4×, 3.3× and 1.74×, respectively. The design uses latch-based memory to reduce area and dynamic clock gating to save power.

IEEE Journal of Solid-state Circuits | 2017

A 1.5-GHz 6.144T Correlations/s 64

John T. Bell; Phil Knag; Shuanghong Sun; Yong Lim; Thomas C. Chen; Jeffrey A. Fredenburg; Chia Hsiang Chen; Chunyang Zhai; Aaron Z. Rocca; Nicholas Collins; Andres Tamez; Jorge Pernillo; Justin M. Correll; Alan B. Tanner; Zhengya Zhang; Michael P. Flynn

A 65-nm CMOS, 18-mm<sup>2</sup>, 1.5-GHz 64 <inline-formula> <tex-math notation=LaTeX>

Archive | 2016