Bernard Brezzo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bernard Brezzo is active.

Explore More

Publication

Featured researches published by Bernard Brezzo.

Science | 2014

A million spiking-neuron integrated circuit with a scalable communication network and interface

Paul A. Merolla; John V. Arthur; Rodrigo Alvarez-Icaza; Andrew S. Cassidy; Jun Sawada; Filipp Akopyan; Bryan L. Jackson; Nabil Imam; Chen Guo; Yutaka Nakamura; Bernard Brezzo; Ivan Vo; Steven K. Esser; Rathinakumar Appuswamy; Brian Taba; Arnon Amir; Myron Flickner; William P. Risk; Rajit Manohar; Dharmendra S. Modha

Modeling computer chips on real brains Computers are nowhere near as versatile as our own brains. Merolla et al. applied our present knowledge of the structure and function of the brain to design a new computer chip that uses the same wiring rules and architecture. The flexible, scalable chip operated efficiently in real time, while using very little power. Science, this issue p. 668 A large-scale computer chip mimics many features of a real brain. Inspired by the brain’s structure, we have developed an efficient, scalable, and flexible non–von Neumann architecture that leverages contemporary silicon technology. To demonstrate, we built a 5.4-billion-transistor chip with 4096 neurosynaptic cores interconnected via an intrachip network that integrates 1 million programmable spiking neurons and 256 million configurable synapses. Chips can be tiled in two dimensions via an interchip communication interface, seamlessly scaling the architecture to a cortexlike sheet of arbitrary size. The architecture is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification. With 400-pixel-by-240-pixel video input at 30 frames per second, the chip consumes 63 milliwatts.

custom integrated circuits conference | 2011

A 45nm CMOS neuromorphic chip with a scalable architecture for learning in networks of spiking neurons

Jae-sun Seo; Bernard Brezzo; Yong Liu; Benjamin D. Parker; Steven K. Esser; Robert K. Montoye; Bipin Rajendran; Jose A. Tierno; Leland Chang; Dharmendra S. Modha; Daniel J. Friedman

Efforts to achieve the long-standing dream of realizing scalable learning algorithms for networks of spiking neurons in silicon have been hampered by (a) the limited scalability of analog neuron circuits; (b) the enormous area overhead of learning circuits, which grows with the number of synapses; and (c) the need to implement all inter-neuron communication via off-chip address-events. In this work, a new architecture is proposed to overcome these challenges by combining innovations in computation, memory, and communication, respectively, to leverage (a) robust digital neuron circuits; (b) novel transposable SRAM arrays that share learning circuits, which grow only with the number of neurons; and (c) crossbar fan-out for efficient on-chip inter-neuron communication. Through tight integration of memory (synapses) and computation (neurons), a highly configurable chip comprising 256 neurons and 64K binary synapses with on-chip learning based on spike-timing dependent plasticity is demonstrated in 45nm SOI-CMOS. Near-threshold, event-driven operation at 0.53V is demonstrated to maximize power efficiency for real-time pattern classification, recognition, and associative memory tasks. Future scalable systems built from the foundation provided by this work will open up possibilities for ubiquitous ultra-dense, ultra-low power brain-like cognitive computers.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015

TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip

Filipp Akopyan; Jun Sawada; Andrew S. Cassidy; Rodrigo Alvarez-Icaza; John V. Arthur; Paul A. Merolla; Nabil Imam; Yutaka Nakamura; Pallab Datta; Gi-Joon Nam; Brian Taba; Michael P. Beakes; Bernard Brezzo; Jente B. Kuang; Rajit Manohar; William P. Risk; Bryan L. Jackson; Dharmendra S. Modha

The new era of cognitive computing brings forth the grand challenge of developing systems capable of processing massive amounts of noisy multisensory data. This type of intelligent computing poses a set of constraints, including real-time operation, low-power consumption and scalability, which require a radical departure from conventional system design. Brain-inspired architectures offer tremendous promise in this area. To this end, we developed TrueNorth, a 65 mW real-time neurosynaptic processor that implements a non-von Neumann, low-power, highly-parallel, scalable, and defect-tolerant architecture. With 4096 neurosynaptic cores, the TrueNorth chip contains 1 million digital neurons and 256 million synapses tightly interconnected by an event-driven routing infrastructure. The fully digital 5.4 billion transistor implementation leverages existing CMOS scaling trends, while ensuring one-to-one correspondence between hardware and software. With such aggressive design metrics and the TrueNorth architecture breaking path with prevailing architectures, it is clear that conventional computer-aided design (CAD) tools could not be used for the design. As a result, we developed a novel design methodology that includes mixed asynchronous-synchronous circuits and a complete tool flow for building an event-driven, low-power neurosynaptic chip. The TrueNorth chip is fully configurable in terms of connectivity and neural parameters to allow custom configurations for a wide range of cognitive and sensory perception applications. To reduce the systems communication energy, we have adapted existing application-agnostic very large-scale integration CAD placement tools for mapping logical neural networks to the physical neurosynaptic core locations on the TrueNorth chips. With that, we have successfully demonstrated the use of TrueNorth-based systems in multiple applications, including visual object recognition, with higher performance and orders of magnitude lower power consumption than the same algorithms run on von Neumann architectures. The TrueNorth chip and its tool flow serve as building blocks for future cognitive systems, and give designers an opportunity to develop novel brain-inspired architectures and systems based on the knowledge obtained from this paper.

field programmable gate arrays | 2012

A cycle-accurate, cycle-reproducible multi-FPGA system for accelerating multi-core processor simulation

Sameh W. Asaad; Ralph Bellofatto; Bernard Brezzo; Chuck Haymes; Mohit Kapur; Benjamin D. Parker; Thomas Roewer; Proshanta Saha; Todd E. Takken; Jose A. Tierno

Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale FPGA platform that is designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBMs 45 nm SOI CMOS technology. This paper discusses the challenges for constructing such large-scale FPGA platforms, including design partitioning, clocking & synchronization, and debugging support, as well as our approach for addressing these challenges without sacrificing cycle accuracy and cycle reproducibility. The resulting fullchip simulation of the Bluegene/Q compute node ASIC runs at a simulated processor clock speed of 4 MHz, over 100,000 times faster than the logic level software simulation of the same design. The vast increase in simulation speed provides a new capability in the design cycle that proved to be instrumental in logic verification as well as early software development and performance validation for Bluegene/Q.

international conference on parallel architectures and compilation techniques | 2012

Database analytics acceleration using FPGAs

Bharat Sukhwani; Hong Min; Mathew S. Thoennes; Parijat Dube; Balakrishna R. Iyer; Bernard Brezzo; Donna N. Dillenberger; Sameh W. Asaad

Business growth and technology advancements have resulted in growing amounts of enterprise data. To gain valuable business insight and competitive advantage, businesses demand the capability of performing real-time analytics on such data. This, however, involves expensive query operations that are very time consuming on traditional CPUs. Additionally, in traditional database management systems (DBMS), the CPU resources are dedicated to mission-critical transactional workloads. Offloading expensive analytics query operations to a co-processor can allow efficient execution of analytics workloads in parallel with transactional workloads. In this paper, we present a Field Programmable Gate Array (FPGA) based acceleration engine for database operations in analytics queries. The proposed solution provides a mechanism for a DBMS to seamlessly harness the FPGA compute power without requiring any changes in the application or the existing data layout. Using a software-programmed query control block, the accelerator can be tailored to execute different queries without reconfiguration. Our prototype is implemented in a PCIe-attached FPGA system and is integrated into a commercial DBMS platform. The results demonstrate up to 94% CPU savings on real customer data compared to the baseline software cost with up to an order of magnitude speedup in the offloaded computations and up to 6.2× improvement in end-to-end performance.

international solid-state circuits conference | 1996

Single chip 4/spl times/500 Mbaud CMOS transceiver

Albert X. Widmer; Kevin R. Wrenner; Herschel A. Ainspan; Ben Parker; Pierre Austruy; Bernard Brezzo; Anne-Marie Haen; John F. Ewen; Mehmet Soyuer; Alain Blanc; Jean-Claude Abbiate; Alina Deutsch; Hyun J. Shin

This CMOS chip replaces a 72-wire interface with 4 serial, duplex links, for relief of interconnect congestion in applications such as large switching systems. The design supports transmission at 1.6 Gb/s per direction in full-duplex mode and provides the user with a transparent interface. The data source provides fixed-length synchronous packets segmented into 4 parallel bytes along with parity and flag bits. The packet size can be programmed up to 4/spl times/64 B with a parameter loaded from an external controller. Data packets can he transmitted contiguously. During idle periods that are marked by a flag, the circuit generates and transmits fill packets, which start with a non-data Comma character. The Comma marks both byte and packet boundaries on a serial link. The Fill packets carry an idle sequence or diagnostic and control information such as Not Operational, Remote Wrap, or Unwrap. Each link carries 400 Mb/s, corresponding to 500 Mbaud after 8 B/10 B encoding.

ieee international conference on high performance computing data and analytics | 2014

Real-time scalable cortical computing at 46 giga-synaptic OPS/watt with ~100× speedup in time-to-solution and ~100,000× reduction in energy-to-solution

Andrew S. Cassidy; Rodrigo Alvarez-Icaza; Filipp Akopyan; Jun Sawada; John V. Arthur; Paul A. Merolla; Pallab Datta; Marc Gonzalez Tallada; Brian Taba; Alexander Andreopoulos; Arnon Amir; Steven K. Esser; Jeff Kusnitz; Rathinakumar Appuswamy; Chuck Haymes; Bernard Brezzo; Roger Moussalli; Ralph Bellofatto; Christian W. Baks; Michael Mastro; Kai Schleupen; Charles Edwin Cox; Ken Inoue; Steven Edward Millman; Nabil Imam; Emmett McQuinn; Yutaka Nakamura; Ivan Vo; Chen Guok; Don Nguyen

Drawing on neuroscience, we have developed a parallel, event-driven kernel for neurosynaptic computation, that is efficient with respect to computation, memory, and communication. Building on the previously demonstrated highly optimized software expression of the kernel, here, we demonstrate True North, a co-designed silicon expression of the kernel. True North achieves five orders of magnitude reduction in energy to-solution and two orders of magnitude speedup in time-to solution, when running computer vision applications and complex recurrent neural network simulations. Breaking path with the von Neumann architecture, True North is a 4,096 core, 1 million neuron, and 256 million synapse brain-inspired neurosynaptic processor, that consumes 65mW of power running at real-time and delivers performance of 46 Giga-Synaptic OPS/Watt. We demonstrate seamless tiling of True North chips into arrays, forming a foundation for cortex-like scalability. True Norths unprecedented time-to-solution, energy-to-solution, size, scalability, and performance combined with the underlying flexibility of the kernel enable a broad range of cognitive applications.

International Journal of Parallel Programming | 2015

A Hardware/Software Approach for Database Query Acceleration with FPGAs

Bharat Sukhwani; Mathew S. Thoennes; Hong Min; Parijat Dube; Bernard Brezzo; Sameh W. Asaad; Donna N. Dillenberger

Complex analytics queries often involve expensive operations that may require large computational runtimes leading to slow query responsiveness and hampering real-time performance. Moreover, running these expensive analytics queries inside traditional online transaction processing (OLTP) systems for real-time analytics can affect the performance of mission-critical OLTP queries. On the other hand, support for real-time analytics is considered vital for important business insights and improved market responsiveness. In this paper, we try to address the needs of real-time analytics by enabling hardware acceleration of complex database query operations such as predicate evaluation, sort and projection. While projection helps reduce the amount of data being processed by subsequent query operations, sort is central to most database queries, even those not involving an explicit sort operation. Our system involves FPGA-based composable accelerator for offloading the analytics queries from the host CPU running the OLTP workload. The FPGA-accelerated database system contains accelerator kernels for various database operations and automatic transformation of query operations into calls to these hardware kernels for seamless integration of the accelerator into the database system. Based on the query semantics, each accelerator kernel can be tailored by software to execute specific database operations and different kernels can be fused together to compose a query accelerator. Our query transformation algorithm creates a query-specific control block to customize the accelerator without requiring FPGA-reconfiguration.

field-programmable custom computing machines | 2011

High-Throughput, Lossless Data Compresion on FPGAs

Bharat Sukhwani; Bulent Abali; Bernard Brezzo; Sameh W. Asaad

Loss less compression is often used before writing data to a storage medium or transmitting across a transmission medium. Compression aids by saving storage space or transmission bandwidth, a decompression operation is performed when the data is subsequently read. Though this scheme has clear benefits, the execution time of compression and decompression is critical to its application in real-time systems. Software compression utilities are often slow, leading to degraded system performance. Hardware-based solutions, on the other hand, often drive large resource requirements and are not amenable to supporting future algorithmic changes. In the current article, we present a high-throughput, streaming, loss less compression algorithm and its efficient implementation on FPGAs. The proposed solution provides a peak throughput of 1GB/sec per engine, with a sustained overall measured throughput of 2.66GB/sec on a PCIe-based FPGA board with two compression and two decompression engines. This result represents an overall speedup of 13.6x over reference software implementation. The proposed design is very lean, and, with multiple engines running in parallel, provides a path to potential speedups of up to two orders of magnitude. In the current implementation, the achievable overall throughput is limited only by the available PCIe bus bandwidth.

international solid-state circuits conference | 1996

Single chip 4×500 Mbaud CMOS transceiver

Albert X. Widmer; Kevin R. Wrenner; Herschel A. Ainspan; Ben Parker; Pierre Austruy; Bernard Brezzo; A.-M. Haen; John F. Ewen; Mehmet Soyuer; Alain Blanc; Jean-Claude Abbiate; A. Deutsch; Hyun J. Shin

A CMOS chip containing four 500-MBd serializer/deserializer pairs has been designed to relieve interconnect congestion in an ATM switch system. The 9.7 9.7 mm2 chip fabricated in a 0.8m technology is packaged on a ceramic ball grid array and dissipates 3.5 W. It replaces a 72-wire parallel interface with an eight-line serial interface transparent to the user and supports transmission at 1.6 Gb/s per direction in full-duplex mode. Virtually error-free operation in a system environment over electrical serial links having up to 9 dB loss at 500 MHz has been realized using signal predistortion for the serial bit stream and PLL clock recovery for each of the four receivers. Interface timing and serial-link driver strength are programmable.

Explore More