Featured Researches

Hardware Architecture

Aging-Aware Request Scheduling for Non-Volatile Main Memory

Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity and low-cost main memory. Elevated operating voltages of NVM accelerate the aging of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device scaling increases power density and temperature, which further accelerates aging, challenging the reliable operation of NVM-based main memory. We propose HEBE, an architectural technique to mitigate the circuit aging-related problems of NVM-based main memory. HEBE is built on three contributions. First, we propose a new analytical model that can dynamically track the aging in the peripheral circuitry of each memory bank based on the bank's utilization. Second, we develop an intelligent memory request scheduler that exploits this aging model at run time to de-stress the peripheral circuitry of a memory bank only when its aging exceeds a critical threshold. Third, we introduce an isolation transistor to decouple parts of a peripheral circuit operating at different voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations independently and off the critical path of memory read and write accesses, improving performance. We evaluate HEBE with workloads from the SPEC CPU2017 Benchmark suite. Our results show that HEBE significantly improves both performance and lifetime of NVM-based main memory.

Read more
Hardware Architecture

Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node

Binary Neural Networks (BNNs) have been shown to be robust to random bit-level noise, making aggressive voltage scaling attractive as a power-saving technique for both logic and SRAMs. In this work, we introduce the first fully programmable IoT end-node system-on-chip (SoC) capable of executing software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC exploits a hybrid memory scheme where error-vulnerable SRAMs are complemented by reliable standard-cell memories to safely store critical data under aggressive voltage scaling. On a prototype in 22nm FDX technology, we demonstrate that both the logic and SRAM voltage can be dropped to 0.5Vwithout any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving energy efficiency by 2.2X w.r.t. nominal conditions. Furthermore, we show that the supply voltage can be dropped to 0.42V (50% of nominal) while keeping more than99% of the nominal accuracy (with a bit error rate ~1/1000). In this operating point, our prototype performs 4Gop/s (15.4Inference/s on the CIFAR-10 dataset) by computing up to 13binary ops per pJ, achieving 22.8 Inference/s/mW while keeping within a peak power envelope of 674uW - low enough to enable always-on operation in ultra-low power smart cameras, long-lifetime environmental sensors, and insect-sized pico-drones.

Read more
Hardware Architecture

An 8-bit In Resistive Memory Computing Core with Regulated Passive Neuron and Bit Line Weight Mapping

The rapid development of Artificial Intelligence (AI) and Internet of Things (IoT) increases the requirement for edge computing with low power and relatively high processing speed devices. The Computing-In-Memory(CIM) schemes based on emerging resistive Non-Volatile Memory(NVM) show great potential in reducing the power consumption for AI computing. However, the device inconsistency of the non-volatile memory may significantly degenerate the performance of the neural network. In this paper, we propose a low power Resistive RAM (RRAM) based CIM core to not only achieve high computing efficiency but also greatly enhance the robustness by bit line regulator and bit line weight mapping algorithm. The simulation results show that the power consumption of our proposed 8-bit CIM core is only 3.61mW (256*256). The SFDR and SNDR of the CIM core achieve 59.13 dB and 46.13 dB, respectively. The proposed bit line weight mapping scheme improves the top-1 accuracy by 2.46% and 3.47% for AlexNet and VGG16 on ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC 2012) in 8-bit mode, respectively.

Read more
Hardware Architecture

An Analytical Model for Performance and Lifetime Estimation of Hybrid DRAM-NVM Main Memories

NVMs have promising advantages (e.g., lower idle power, higher density) over the existing predominant main memory technology, DRAM. Yet, NVMs also have disadvantages (e.g., limited endurance). System architects are therefore examining hybrid DRAM-NVM main memories to enable the advantages of NVMs while avoiding the disadvantages as much as possible. Unfortunately, the hybrid memory design space is very large and complex due to the existence of very different types of NVMs and their rapidly-changing characteristics. Therefore, optimization of performance and lifetime of hybrid memory based computing platforms and their experimental evaluation using traditional simulation methods can be very time-consuming and sometimes even impractical. As such, it is necessary to develop a fast and flexible analytical model to estimate the performance and lifetime of hybrid memories on various workloads. This paper presents an analytical model for hybrid memories based on Markov decision processes. The proposed model estimates the hit ratio and lifetime for various configurations of DRAM-NVM hybrid main memories. Our model also provides accurate estimation of the effect of data migration policies on the hybrid memory hit ratio, one of the most important factors in hybrid memory performance and lifetime. Such an analytical model can aid designers to tune hybrid memory configurations to improve performance and/or lifetime. We present several optimizations that make our model more efficient while maintaining its accuracy. Our experimental evaluations show that the proposed model (a) accurately predicts the hybrid memory hit ratio with an average error of 4.61% on a commodity server, (b) accurately estimates the NVM lifetime with an average error of 2.93%, and (c) is on average 4x faster than conventional state-of-the-art simulation platforms for hybrid memories.

Read more
Hardware Architecture

An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration

In recent years, neural networks have surpassed classical algorithms in areas such as object recognition, e.g. in the well-known ImageNet challenge. As a result, great effort is being put into developing fast and efficient accelerators, especially for Convolutional Neural Networks (CNNs). In this work we present ConvAix, a fully C-programmable processor, which -- contrary to many existing architectures -- does not rely on a hard-wired array of multiply-and-accumulate (MAC) units. Instead it maps computations onto independent vector lanes making use of a carefully designed vector instruction set. The presented processor is targeted towards latency-sensitive applications and is capable of executing up to 192 MAC operations per cycle. ConvAix operates at a target clock frequency of 400 MHz in 28nm CMOS, thereby offering state-of-the-art performance with proper flexibility within its target domain. Simulation results for several 2D convolutional layers from well known CNNs (AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector instructions with 16 bit fixed-point arithmetic. Compared to other well-known designs which are less flexible, ConvAix offers competitive energy efficiency of up to 497 GOP/s/W while even surpassing them in terms of area efficiency and processing speed.

Read more
Hardware Architecture

An Approximate Carry Estimating Simultaneous Adder with Rectification

Approximate computing has in recent times found significant applications towards lowering power, area, and time requirements for arithmetic operations. Several works done in recent years have furthered approximate computing along these directions. In this work, we propose a new approximate adder that employs a carry prediction method. This allows parallel propagation of the carry allowing faster calculations. In addition to the basic adder design, we also propose a rectification logic which would enable higher accuracy for larger computations. Experimental results show that our adder produces results 91.2% faster than the conventional ripple-carry adder. In terms of accuracy, the addition of rectification logic to the basic design produces results that are more accurate than state-of-the-art adders like SARA and BCSA by 74%.

Read more
Hardware Architecture

An Embedded RISC-V Core with Fast Modular Multiplication

One of the biggest concerns in IoT is privacy and security. Encryption and authentication need big power budgets, which battery-operated IoT end-nodes do not have. Hardware accelerators designed for specific cryptographic operations provide little to no flexibility for future updates. Custom instruction solutions are smaller in area and provide more flexibility for new methods to be implemented. One drawback of custom instructions is that the processor has to wait for the operation to finish. Eventually, the response time of the device to real-time events gets longer. In this work, we propose a processor with an extended custom instruction for modular multiplication, which blocks the processor, typically, two cycles for any size of modular multiplication when used in Partial Execution mode. We adopted embedded and compressed extensions of RISC-V for our proof-of-concept CPU. Our design is benchmarked on recent cryptographic algorithms in the field of elliptic-curve cryptography. Our CPU with 128-bit modular multiplication operates at 136MHz on ASIC and 81MHz on FPGA. It achieves up to 13x speed up on software implementations while reducing overall power consumption by up to 95\% with 41\% average area overhead over our base architecture.

Read more
Hardware Architecture

An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

Main memories play an important role in overall energy consumption of embedded systems. Using conventional memory technologies in future designs in nanoscale era causes a drastic increase in leakage power consumption and temperature-related problems. Emerging non-volatile memory (NVM) technologies offer many desirable characteristics such as near-zero leakage power, high density and non-volatility. They can significantly mitigate the issue of memory leakage power in future embedded chip-multiprocessor (eCMP) systems. However, they suffer from challenges such as limited write endurance and high write energy consumption which restrict them for adoption in modern memory systems. In this article, we present a convex optimization model to design a 3D stacked hybrid memory architecture in order to minimize the future embedded systems energy consumption in the dark silicon era. This proposed approach satisfies endurance constraint in order to design a reliable memory system. Our convex model optimizes numbers and placement of eDRAM and STT-RAM memory banks on the memory layer to exploit the advantages of both technologies in future eCMPs. Energy consumption, the main challenge in the dark silicon era, is represented as a major target in this work and it is minimized by the detailed optimization model in order to design a dark silicon aware 3D Chip-Multiprocessor. Experimental results show that in comparison with the Baseline memory design, the proposed architecture improves the energy consumption and performance of the 3D CMP on average about 61.33 and 9 percent respectively.

Read more
Hardware Architecture

An Energy-Efficient Low-Voltage Swing Transceiver for mW-Range IoT End-Nodes

As the Internet-of-Things (IoT) applications become more and more pervasive, IoT end nodes are requiring more and more computational power within a few mW of power envelope, coupled with high-speed and energy-efficient inter-chip communication to deal with the growing input/output and memory bandwidth for emerging near-sensor analytics applications. While traditional interfaces such as SPI cannot cope with these tight requirements, low-voltage swing transceivers can tackle this challenge thanks to their capability to achieve several Gbps of bandwidth at extremely low power. However, recent research on high-speed serial links addressed this challenge only partially, proposing only partial or stand-alone designs, and not addressing their integration in real systems and the related implications. In this paper, we present for the first time a complete design and system-level architecture of a low-voltage swing transceiver integrated within a low-power (mW range) IoT end-node processors, and we compare it with existing microcontroller interfaces. The transceiver, implemented in a commercial 65-nm CMOS technology achieves 10.2x higher energy efficiency at 15.7x higher performance than traditional microcontroller peripherals (single lane).

Read more
Hardware Architecture

An Ensemble Learning Approach for In-situ Monitoring of FPGA Dynamic Power

As field-programmable gate arrays become prevalent in critical application domains, their power consumption is of high concern. In this paper, we present and evaluate a power monitoring scheme capable of accurately estimating the runtime dynamic power of FPGAs in a fine-grained timescale, in order to support emerging power management techniques. In particular, we describe a novel and specialized ensemble model which can be decomposed into multiple customized decision-tree-based base learners. To aid in model synthesis, a generic computer-aided design flow is proposed to generate samples, select features, tune hyperparameters and train the ensemble estimator. Besides this, a hardware realization of the trained ensemble estimator is presented for on-chip real-time power estimation. In the experiments, we first show that a single decision tree model can achieve prediction error within 4.51% of a commercial gate-level power estimation tool, which is 2.41--6.07x lower than provided by the commonly used linear model. More importantly, we study the extra gains in inference accuracy using the proposed ensemble model. Experimental results reveal that the ensemble monitoring method can further improve the accuracy of power predictions to within a maximum error of 1.90%. Moreover, the lookup table (LUT) overhead of the ensemble monitoring hardware employing up to 64 base learners is within 1.22% of the target FPGA, indicating its light-weight and scalable characteristics.

Read more

Ready to get started?

Join us today