Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where M. Hassan Najafi is active.

Publication


Featured researches published by M. Hassan Najafi.


field programmable gate arrays | 2016

Using Stochastic Computing to Reduce the Hardware Requirements for a Restricted Boltzmann Machine Classifier

Bingzhe Li; M. Hassan Najafi; David J. Lilja

Artificial neural networks are powerful computational systems with interconnected neurons. Generally, these networks have a very large number of computation nodes which forces the designer to use software-based implementations. However, the software based implementations are offline and not suitable for portable or real-time applications. Experiments show that compared with the software based implementations, FPGA-based systems can greatly speed up the computation time, making them suitable for real-time situations and portable applications. However, the FPGA implementation of neural networks with a large number of nodes is still a challenging task. In this paper, we exploit stochastic bit streams in the Restricted Boltzmann Machine (RBM) to implement the classification of the RBM handwritten digit recognition application completely on an FPGA. We use finite state machine-based (FSM) stochastic circuits to implement the required sigmoid function and use the novel stochastic computing approach to perform all large matrix multiplications. Experimental results show that the proposed stochastic architecture has much more potential for tolerating faults while requiring much less hardware compared to the currently un-implementable deterministic binary approach when the RBM consists of a large number of neurons. Exploiting the features of stochastic circuits, our implementation achieves much better performance than a software-based approach.


asia and south pacific design automation conference | 2016

Polysynchronous stochastic circuits

M. Hassan Najafi; David J. Lilja; Marc D. Riedel; Kia Bazargan

Clock distribution networks (CDNs) are costly in high-performance ASICs. This paper proposes a new approach: splitting clock domains at a very fine level, down to the level of a handful of gates. Each domain is synchronized with an inexpensive clock signal, generated locally. This is possible by adopting the paradigm of stochastic computation, where signal values are encoded as random bit streams. The design method is illustrated with the synthesis of circuits for applications in signal and image processing.


application-specific systems, architectures, and processors | 2015

An FPGA implementation of a Restricted Boltzmann Machine classifier using stochastic bit streams

Bingzhe Li; M. Hassan Najafi; David J. Lilja

Artificial neural networks (ANNs) usually require a very large number of computation nodes and can be implemented either in software or directly in hardware, such as FPGAs. Software-based approaches are offline and not suitable for real-time applications, but they support a large number of nodes. FPGA-based implementations, in contrast, can greatly speedup the computation time. However, resource limitations in an FPGA restrict the maximum number of computation nodes in hardware-based approaches. This work exploits stochastic bit streams to implement the Restricted Boltzmann Machine (RBM) handwritten digit recognition application completely on an FPGA. Exploiting this approach saves a large number of hardware resources making the FPGA-based implementation of large ANNs feasible.


asia and south pacific design automation conference | 2017

High-speed stochastic circuits using synchronous analog pulses

M. Hassan Najafi; David J. Lilja

The primary advantages of stochastic computing are the very simple hardware required to implement complex operations, its ability to gracefully tolerate noise, and the skew tolerance. Its relatively long latency, however, is a potential barrier to widespread use of this paradigm, particularly when high accuracy is required. This work proposes a new, high-speed, yet accurate approach for implementing stochastic circuits that uses synchronized analog pulses as a new way of representing correlated stochastic numbers.


IEEE Transactions on Very Large Scale Integration Systems | 2017

Time-Encoded Values for Highly Efficient Stochastic Circuits

M. Hassan Najafi; Shiva Jamali-Zavareh; David J. Lilja; Marc D. Riedel; Kia Bazargan; Ramesh Harjani

Stochastic computing (SC) is a promising technique for applications that require low area overhead and fault tolerance, but can tolerate relatively high latency. In the SC paradigm, logical computation is performed on randomized bit streams. In prior work, streams were generated with linear feedback shift registers; these contributed heavily to the hardware cost and consumed a significant amount of power. This paper introduces a new approach for encoding signal values: computation is performed on analog periodic pulse signals. Exploiting pulse width modulation, time-encoded signals corresponding to specific values are generated by adjusting the frequency and duty cycles of pulse width modulated (PWM) signals. With this approach, the latency, area, and energy consumption are all greatly reduced. Experimental results on image processing applications show up to 99% performance speedup, 98% saving in energy dissipation, and 40% area reduction compared to prior stochastic approaches. Circuits synthesized with the proposed approach can work as fast and energy-efficiently as a conventional binary design while retaining the fault-tolerance and low-cost advantages of conventional stochastic designs.


IEEE Transactions on Computers | 2017

Polysynchronous Clocking: Exploiting the Skew Tolerance of Stochastic Circuits

M. Hassan Najafi; David J. Lilja; Marc D. Riedel; Kia Bazargan

In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit streams. The method naturally and effectively tolerates very high clock skew. Exploiting this advantage, this paper introduces polysynchronous clocking, a design strategy in which clock domains are split at a very fine level. Each domain is synchronized by an inexpensive local clock. Alternatively, the skew requirements for a global clock distribution network can be relaxed. This allows for a higher working frequency and so lower latency. The benefits of both approaches are quantified. Polysynchronous clocking results in significant latency, area, and energy savings for wide variety of applications.


ACM Journal on Emerging Technologies in Computing Systems | 2017

A Reconfigurable Architecture with Sequential Logic-Based Stochastic Computing

M. Hassan Najafi; Peng Li; David J. Lilja; Weikang Qian; Kia Bazargan; Marc D. Riedel

Computations based on stochastic bit streams have several advantages compared to deterministic binary radix computations, including low power consumption, low hardware cost, high fault tolerance, and skew tolerance. To take advantage of this computing technique, previous work proposed a combinational logic-based reconfigurable architecture to perform complex arithmetic operations on stochastic streams of bits. The long execution time and the cost of converting between binary and stochastic representations, however, make the stochastic architectures less energy efficient than the deterministic binary implementations. This article introduces a methodology for synthesizing a given target function stochastically using finite-state machines (FSMs), and enhances and extends the reconfigurable architecture using sequential logic. Compared to the previous approach, the proposed reconfigurable architecture can save hardware area and energy consumption by up to 30% and 40%, respectively, while achieving a higher processing speed. Both stochastic reconfigurable architectures are much more tolerant of soft errors (bit flips) than the deterministic binary radix implementations, and their fault tolerance scales gracefully to very large numbers of errors.


IEEE Transactions on Emerging Topics in Computing | 2018

High Quality Down-Sampling for Deterministic Approaches to Stochastic Computing

M. Hassan Najafi; David J. Lilja

Deterministic approaches to stochastic computing (SC) have been recently proposed to remove the random fluctuation and correlation problems of SC and so produce completely accurate results with stochastic logic. For many applications of SC, such as image processing and neural networks, completely accurate computation is not required for all input data. Decision-making on some input data can be done in a much shorter time using only a good approximation of the input values. While the deterministic approaches to SC are appealing by generating completely accurate results, the cost of precise results makes them energy inefficient for the cases when slight inaccuracy is acceptable. In this work, we propose a high quality down-sampling method for previously proposed deterministic approaches to SC by generating pseudo-random -- but accurate -- stochastic bit streams. The result is a much better accuracy for a given number of input bits. Experimental results show that the processing time and the energy consumption of these deterministic methods are improved up to 61% and 41%, respectively, while allowing a MAE of 0.1%, and up to 500X and 334X improvement, respectively, for an MAE of 3.0%. The accuracy and the energy consumption are also improved compared to conventional random stream-based stochastic implementations.


IEEE Micro | 2017

An Overview of Time-Based Computing with Stochastic Constructs

M. Hassan Najafi; Shiva Jamali-Zavareh; David J. Lilja; Marc D. Riedel; Kia Bazargan; Ramesh Harjani

Computing on time-based data is a recent evolution of research in stochastic computing. As with stochastic computing, complex functions can be computed with remarkably low area cost. Unlike stochastic computing, the latency and energy efficiency are very favorable compared to computations on conventional binary radix. In this article, the authors review and evaluate the design and implementation of arithmetic operations on time-encoded signals, with a particular focus on low power. The advantages, challenges, and potential applications are discussed.


international conference on parallel and distributed systems | 2015

GPU-Accelerated Nick Local Image Thresholding Algorithm

M. Hassan Najafi; Anirudh Murali; David J. Lilja; John Sartori

Binarization plays an important role in document image processing, particularly in degraded document images. Among all local adaptive image thresholding algorithms, the Nick method has shown excellent binarization performance for degraded document images. However, local image thresholding algorithms, including the Nick method, are computationally intensive, requiring significant time to process input images. In this paper, we propose three CUDA GPU parallel implementations of the Nick local image thresholding algorithm for faster binarization of large images. Our experimental results show that the GPU-accelerated implementations of the Nick method can achieve up to 150x performance speedup on a GeForce GTX 480 compared to its optimized sequential implementation.

Collaboration


Dive into the M. Hassan Najafi's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kia Bazargan

University of Minnesota

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bingzhe Li

University of Minnesota

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bo Yuan

City University of New York

View shared research outputs
Top Co-Authors

Avatar

John Sartori

University of Minnesota

View shared research outputs
Researchain Logo
Decentralizing Knowledge