Jae-sun Seo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jae-sun Seo is active.

Explore More

Publication

Featured researches published by Jae-sun Seo.

custom integrated circuits conference | 2011

A 45nm CMOS neuromorphic chip with a scalable architecture for learning in networks of spiking neurons

Jae-sun Seo; Bernard Brezzo; Yong Liu; Benjamin D. Parker; Steven K. Esser; Robert K. Montoye; Bipin Rajendran; Jose A. Tierno; Leland Chang; Dharmendra S. Modha; Daniel J. Friedman

Efforts to achieve the long-standing dream of realizing scalable learning algorithms for networks of spiking neurons in silicon have been hampered by (a) the limited scalability of analog neuron circuits; (b) the enormous area overhead of learning circuits, which grows with the number of synapses; and (c) the need to implement all inter-neuron communication via off-chip address-events. In this work, a new architecture is proposed to overcome these challenges by combining innovations in computation, memory, and communication, respectively, to leverage (a) robust digital neuron circuits; (b) novel transposable SRAM arrays that share learning circuits, which grow only with the number of neurons; and (c) crossbar fan-out for efficient on-chip inter-neuron communication. Through tight integration of memory (synapses) and computation (neurons), a highly configurable chip comprising 256 neurons and 64K binary synapses with on-chip learning based on spike-timing dependent plasticity is demonstrated in 45nm SOI-CMOS. Near-threshold, event-driven operation at 0.53V is demonstrated to maximize power efficiency for real-time pattern classification, recognition, and associative memory tasks. Future scalable systems built from the foundation provided by this work will open up possibilities for ubiquitous ultra-dense, ultra-low power brain-like cognitive computers.

field programmable gate arrays | 2016

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Naveen Suda; Vikas Chandra; Ganesh Dasika; Abinash Mohanty; Yufei Ma; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform real-time classification with low power consumption on today?s computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

IEEE Transactions on Electron Devices | 2013

Specifications of Nanoscale Devices and Circuits for Neuromorphic Computational Systems

Bipin Rajendran; Yong Liu; Jae-sun Seo; Kailash Gopalakrishnan; Leland Chang; Daniel J. Friedman; Mark B. Ritter

The goal of neuromorphic engineering is to build electronic systems that mimic the ability of the brain to perform fuzzy, fault-tolerant, and stochastic computation, without sacrificing either its space or power efficiency. In this paper, we determine the operating characteristics of novel nanoscale devices that could be used to fabricate such systems. We also compare the performance metrics of a million neuron learning system based on these nanoscale devices with an equivalent implementation that is entirely based on end-of-scaling digital CMOS technology and determine the technology targets to be satisfied by these new devices. We show that neuromorphic systems based on new nanoscale devices can potentially improve density and power consumption by at least a factor of 10, as compared with conventional CMOS implementations.

international solid-state circuits conference | 2010

In situ delay-slack monitor for high-performance processors using an all-digital self-calibrating 5ps resolution time-to-digital converter

David Fick; Nurrachman Liu; Zhiyoong Foo; Matthew Fojtik; Jae-sun Seo; Dennis Sylvester; David T. Blaauw

Advanced CMOS technologies have become highly susceptible to process, voltage, and temperature (PVT) variation. The standard approach for addressing this issue is to increase timing margin at the expense of power and performance. One approach to reclaim these losses relies on canary circuits [1] or sensors [2], which are simple to implement but cannot account for local variations. A more recent approach, called Razor, uses delay speculation coupled with error detection and correction to remove all margins but also imposes significant design complexity [3]. In this paper, we present a minimally-invasive in situ delay slack monitor that directly measures the timing margins on critical timing signals, allowing margins due to both global and local PVT variations to be removed.

design, automation, and test in europe | 2015

Technology-design co-optimization of resistive cross-point array for accelerating learning algorithms on chip

Pai-Yu Chen; Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Binbin Lin; Jieping Ye; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao; Shimeng Yu

Technology-design co-optimization methodologies of the resistive cross-point array are proposed for implementing the machine learning algorithms on a chip. A novel read and write scheme is designed to accelerate the training process, which realizes fully parallel operations of the weighted sum and the weight update. Furthermore, technology and design parameters of the resistive cross-point array are co-optimized to enhance the learning accuracy, latency and energy consumption, etc. In contrast to the conventional memory design, a set of reverse scaling rules is proposed on the resistive cross-point array to achieve high learning accuracy. These include 1) larger wire width to reduce the IR drop on interconnects thereby increasing the learning accuracy; 2) use of multiple cells for each weight element to alleviate the impact of the device variations, at an affordable expense of area, energy and latency. The optimized resistive cross-point array with peripheral circuitry is implemented at the 65 nm node. Its performance is benchmarked for handwritten digit recognition on the MNIST database using gradient-based sparse coding. Compared to state-of-the-art software approach running on CPU, it achieves >103 speed-up and >106 energy efficiency improvement, enabling real-time image feature extraction and learning.

international conference on computer aided design | 2015

Mitigating Effects of Non-ideal Synaptic Device Characteristics for On-chip Learning

Pai-Yu Chen; Binbin Lin; I-Ting Wang; Tuo-Hung Hou; Jieping Ye; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao; Shimeng Yu

The cross-point array architecture with resistive synaptic devices has been proposed for on-chip implementation of weighted sum and weight update in the training process of learning algorithms. However, the non-ideal properties of the synaptic devices available today, such as the nonlinearity in weight update, limited ON/OFF range and device variations, can potentially hamper the learning accuracy. This paper focuses on the impact of these realistic properties on the learning accuracy and proposes the mitigation strategies. Unsupervised sparse coding is selected as a case study algorithm. With the calibration of the realistic synaptic behavior from the measured experimental data, our study shows that the recognition accuracy of MNIST handwriting digits degrades from ~97 % to ~65 %. To mitigate this accuracy loss, the proposed strategies include 1) the smart programming schemes for achieving linear weight update; 2) a dummy column to eliminate the off-state current; 3) the use of multiple cells for each weight element to alleviate the impact of device variations. With the improved synaptic behavior by these strategies, the accuracy increases back to ~95 %, enabling the reliable integration of realistic synaptic devices in the neuromorphic systems.

field programmable gate arrays | 2017

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Yufei Ma; Yu Cao; Sarma B. K. Vrudhula; Jae-sun Seo

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

international solid-state circuits conference | 2010

High-bandwidth and low-energy on-chip signaling with adaptive pre-emphasis in 90nm CMOS

Jae-sun Seo; Ron Ho; Jon Lexau; Michael Dayringer; Dennis Sylvester; David T. Blaauw

Long on-chip wires pose well-known latency, bandwidth, and energy challenges to the designers of high-performance VLSI systems. Repeaters effectively mitigate wire RC effects but do little to improve their energy costs. Moreover, proliferating repeater farms add significant complexity to full-chip integration, motivating circuits to improve wire performance and energy while reducing the number of repeaters. Such methods include capacitive-mode signaling, which combines a capacitive driver with a capacitive load [1,2]; and current-mode signaling, which pairs a resistive driver with a resistive load [3,4]. While both can significantly improve wire performance, capacitive drivers offer added benefits of reduced voltage swing on the wire and intrinsic driver pre-emphasis. As wires scale, slow slew rates on highly resistive interconnects will still limit wire performance due to inter-symbol interference (ISI) [5]. Further improvements can come from equalization circuits on receivers [2] and transmitters [4] that trade off power for bandwidth. In this paper, we extend these ideas to a capacitively driven pulse-mode wire using a transmit-side adaptive FIR filter and a clockless receiver, and show bandwidth densities of 2.2–4.4 Gb/s/µm over 90nm 5mm links, with corresponding energies of 0.24–0.34 pJ/bit on random data.

IEEE Journal of Solid-state Circuits | 2009

A 2.5 mW 80 dB DR 36 dB SNDR 22 MS/s Logarithmic Pipeline ADC

Jongwoo Lee; Joshua Kang; Sung Hyun Park; Jae-sun Seo; Jens Anders; Jorge Guilherme; Michael P. Flynn

A switched-capacitor logarithmic pipeline analog-to-digital converter (ADC) that does not require squaring or any other complex analog function is presented. This approach is attractive where a high dynamic range (DR), but not a high peak SNDR, is required. A prototype signed, 8-bit 1.5 bit-per-stage logarithmic pipeline ADC is designed and fabricated in 0.18 mum CMOS. The 22 MS/s ADC achieves a measured DR of 80 dB and a measured SNDR of 36 dB, occupies 0.56 mm2, and consumes 2.54 mW from a 1.62 V supply. The measured dynamic range figure of merit is 174 dB.

Nanotechnology | 2015

Fully parallel write/read in resistive synaptic array for accelerating on-chip learning.

Ligang Gao; I-Ting Wang; Pai-Yu Chen; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao; Tuo-Hung Hou; Shimeng Yu

A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence and learning. The cross-point array architecture with synaptic devices has been proposed for on-chip implementation of the weighted sum and weight update in the learning algorithms. In this work, forming-free, silicon-process-compatible Ta/TaOx/TiO2/Ti synaptic devices are fabricated, in which >200 levels of conductance states could be continuously tuned by identical programming pulses. In order to demonstrate the advantages of parallelism of the cross-point array architecture, a novel fully parallel write scheme is designed and experimentally demonstrated in a small-scale crossbar array to accelerate the weight update in the training process, at a speed that is independent of the array size. Compared to the conventional row-by-row write scheme, it achieves >30× speed-up and >30× improvement in energy efficiency as projected in a large-scale array. If realistic synaptic device characteristics such as device variations are taken into an array-level simulation, the proposed array architecture is able to achieve ∼95% recognition accuracy of MNIST handwritten digits, which is close to the accuracy achieved by software using the ideal sparse coding algorithm.

Explore More