Featured Researches

Hardware Architecture

A Hardware-Aware Heuristic for the Qubit Mapping Problem in the NISQ Era

Due to several physical limitations in the realisation of quantum hardware, today's quantum computers are qualified as Noisy Intermediate-Scale Quantum (NISQ) hardware. NISQ hardware is characterized by a small number of qubits (50 to a few hundred) and noisy operations. Moreover, current realisations of superconducting quantum chips do not have the ideal all-to-all connectivity between qubits but rather at most a nearest-neighbour connectivity. All these hardware restrictions add supplementary low-level requirements. They need to be addressed before submitting the quantum circuit to an actual chip. Satisfying these requirements is a tedious task for the programmer. Instead, the task of adapting the quantum circuit to a given hardware is left to the compiler. In this paper, we propose a Hardware-Aware mapping transition algorithm (HA) that takes the calibration data into account with the aim to improve the overall fidelity of the circuit. Evaluation results on IBM quantum hardware show that our HA approach can outperform the state of the art both in terms of the number of additional gates and circuit fidelity.

Read more
Hardware Architecture

A Low Power In-Memory Multiplication andAccumulation Array with Modified Radix-4 Inputand Canonical Signed Digit Weights

A mass of data transfer between the processing and storage units has been the leading bottleneck in modern Von-Neuman computing systems, especially when used for Artificial Intelligence (AI) tasks. Computing-in-Memory (CIM) has shown great potential to reduce both latency and power consumption. However, the conventional analog CIM schemes are suffering from reliability issues, which may significantly degenerate the accuracy of the computation. Recently, CIM schemes with digitized input data and weights have been proposed for high reliable computing. However, the properties of the digital memory and input data are not fully utilized. This paper presents a novel low power CIM scheme to further reduce the power consumption by using a Modified Radix-4 (M-RD4) booth algorithm at the input and a Modified Canonical Signed Digit (M-CSD) for the network weights. The simulation results show that M-Rd4 and M-CSD reduce the ratio of 1?1 by 78.5\% on LeNet and 80.2\% on AlexNet, and improve the computing efficiency by 41.6\% in average. The computing-power rate at the fixed-point 8-bit is 60.68 TOPS/s/W.

Read more
Hardware Architecture

A Machine Learning Pipeline Stage for Adaptive Frequency Adjustment

A machine learning (ML) design framework is proposed for adaptively adjusting clock frequency based on propagation delay of individual instructions. A random forest model is trained to classify propagation delays in real time, utilizing current operation type, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipeline stage within a baseline processor. The modified system is experimentally tested at the gate level in 45 nm CMOS technology, exhibiting a speedup of 70% and energy reduction of 30% with coarse-grained ML classification. A speedup of 89% is demonstrated with finer granularities with 15.5% reduction in energy consumption.

Read more
Hardware Architecture

A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs

FM-index is an efficient data structure for string search and is widely used in next-generation sequencing (NGS) applications such as sequence alignment and de novo assembly. Recently, FM-indexing is even performed down to the read level, raising a demand of an efficient algorithm for FM-index construction. In this work, we propose a hardware-compatible Self-Aided Incremental Indexing (SAII) algorithm and its hardware architecture. This novel algorithm builds FM-index with no memory overhead, and the hardware system for realizing the algorithm can be very compact. Parallel architecture and a special prefetch controller is designed to enhance computational efficiency. An SAII-based FM-index constructor is implemented on an Altera Stratix V FPGA board. The presented constructor can support DNA sequences of sizes up to 131,072-bp, which is enough for small-scale references and reads obtained from current major platforms. Because the proposed constructor needs very few hardware resource, it can be easily integrated into different hardware accelerators designed for FM-index-based applications.

Read more
Hardware Architecture

A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference

Low bit-width Quantized Neural Networks (QNNs) enable deployment of complex machine learning models on constrained devices such as microcontrollers (MCUs) by reducing their memory footprint. Fine-grained asymmetric quantization (i.e., different bit-widths assigned to weights and activations on a tensor-by-tensor basis) is a particularly interesting scheme to maximize accuracy under a tight memory constraint. However, the lack of sub-byte instruction set architecture (ISA) support in SoA microprocessors makes it hard to fully exploit this extreme quantization paradigm in embedded MCUs. Support for sub-byte and asymmetric QNNs would require many precision formats and an exorbitant amount of opcode space. In this work, we attack this problem with status-based SIMD instructions: rather than encoding precision explicitly, each operand's precision is set dynamically in a core status register. We propose a novel RISC-V ISA core MPIC (Mixed Precision Inference Core) based on the open-source RI5CY core. Our approach enables full support for mixed-precision QNN inference with different combinations of operands at 16-, 8-, 4- and 2-bit precision, without adding any extra opcode or increasing the complexity of the decode stage. Our results show that MPIC improves both performance and energy efficiency by a factor of 1.1-4.9x when compared to software-based mixed-precision on RI5CY; with respect to commercially available Cortex-M4 and M7 microcontrollers, it delivers 3.6-11.7x better performance and 41-155x higher efficiency.

Read more
Hardware Architecture

A Modern Primer on Processing in Memory

Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated.

Read more
Hardware Architecture

A Novel Low Power Non-Volatile SRAM Cell with Self Write Termination

A non-volatile SRAM cell is proposed for low power applications using Spin Transfer Torque-Magnetic Tunnel Junction (STT-MTJ) devices. This novel cell offers non-volatile storage, thus allowing selected blocks of SRAM to be switched off during standby operation. To further increase the power savings, a write termination circuit is designed which detects completion of MTJ write and closes the bidirectional current path for the MTJ. A reduction of 25.81% in the number of transistors and a reduction of 2.95% in the power consumption is achieved in comparison to prior work on write termination circuits.

Read more
Hardware Architecture

A Novel Method for Scalable VLSI Implementation of Hyperbolic Tangent Function

Hyperbolic tangent and Sigmoid functions are used as non-linear activation units in the artificial and deep neural networks. Since, these networks are computationally expensive, customized accelerators are designed for achieving the required performance at lower cost and power. The activation function and MAC units are the key building blocks of these neural networks. A low complexity and accurate hardware implementation of the activation function is required to meet the performance and area targets of such neural network accelerators. Moreover, a scalable implementation is required as the recent studies show that the DNNs may use different precision in different layers. This paper presents a novel method based on trigonometric expansion properties of the hyperbolic function for hardware implementation which can be easily tuned for different accuracy and precision requirements.

Read more
Hardware Architecture

A Post-Silicon Trace Analysis Approach for System-on-Chip Protocol Debug

Reconstructing system-level behavior from silicon traces is a critical problem in post-silicon validation of System-on-Chip designs. Current industrial practice in this area is primarily manual, depending on collaborative insights of the architects, designers, and validators. This paper presents a trace analysis approach that exploits architectural models of the system-level protocols to reconstruct design behavior from partially observed silicon traces in the presence of ambiguous and noisy data. The output of the approach is a set of all potential interpretations of a system's internal executions abstracted to the system-level protocols. To support the trace analysis approach, a companion trace signal selection framework guided by system-level protocols is also presented, and its impacts on the complexity and accuracy of the analysis approach are discussed. That approach and the framework have been evaluated on a multi-core system-on-chip prototype that implements a set of common industrial system-level protocols.

Read more
Hardware Architecture

A RISC-V SystemC-TLM simulator

This work presents a SystemC-TLM based simulator for a RISC-V microcontroller. This simulator is focused on simplicity and easy expandable of a RISC-V. It is built around a full RISC-V instruction set simulator that supports full RISC-V ISA and extensions M, A, C, Zicsr and Zifencei. The ISS is encapsulated in a TLM-2 wrapper that enables it to communicate with any other TLM-2 compatible module. The simulator also includes a very basic set of peripherals to enable a complete SoC simulator. The running code can be compiled with standard tools and using standard C libraries without modifications. The simulator is able to correctly execute the riscv-compliance suite. The entire simulator is published as a docker image to ease its installation and use by developers. A porting of FreeRTOSv10.2.1 for the simulated SoC is also published.

Read more

Ready to get started?

Join us today