Computer Science Hardware Architecture - Researchain

Featured Researches

Area Optimized Quasi Delay Insensitive Majority Voter for TMR Applications

Mission-critical and safety-critical applications generally tend to incorporate triple modular redundancy (TMR) to embed fault tolerance in their physical implementations. In a TMR realization, an original function block, which may be a circuit or a system, and two exact copies of the function block are used to successfully overcome any temporary fault or permanent failure of an arbitrary function block during the routine operation. The corresponding outputs of the function blocks are majority voted using 3-input majority voters whose outputs define the outputs of a TMR realization. Hence, a 3-input majority voter forms an important component of a TMR realization. Many synchronous majority voters and an asynchronous non-delay insensitive majority voter have been presented in the literature. Recently, quasi delay insensitive (QDI) asynchronous majority voters for TMR applications were also discussed in the literature. In this regard, this paper presents a new QDI asynchronous majority voter for TMR applications, which is better optimized in area compared to the existing QDI majority voters. The proposed QDI majority voter requires 30.2% less area compared to the best of the existing QDI majority voters, and this could be useful for resource-constrained fault tolerance applications. The example QDI TMR circuits were implemented using a 32/28nm complementary metal oxide semiconductor (CMOS) process. The delay insensitive dual rail code was used for data encoding, and 4-phase return-to-zero and return-to-one handshake protocols were used for data communication.

Hardware Architecture

Arnold: an eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End-Nodes

A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22 nm Globalfoundries GF22FDX (GF22FDX) technology, coupled with a stateof-the-art (SoA) microcontroller to an embedded Field Programmable Gate Array (FPGA). We demonstrate the flexibility of the System-OnChip (SoC) to tackle the challenges of many emerging IoT applications, such as (i) interfacing sensors and accelerators with non-standard interfaces, (ii) performing on-the-fly pre-processing tasks on data streamed from peripherals, and (iii) accelerating near-sensor analytics, encryption, and machine learning tasks. A unique feature of the proposed SoC is the exploitation of body-biasing to reduce leakage power of the embedded FPGA (eFPGA) fabric by up to 18x at 0.5 V, achieving SoA state bitstream-retentive sleep power for the eFPGA fabric, as low as 20.5 uW. The proposed SoC provides 3.4x better performance and 2.9x better energy efficiency than other fabricated heterogeneous re-configurable SoCs of the same class.

Hardware Architecture

Arsenal of Hardware Prefetchers

Hardware prefetching is one of the latency tolerance optimization techniques that tolerate costly DRAM accesses. Though hardware prefetching is one of the fundamental mechanisms prevalent on most of the commercial machines, there is no prefetching technique that works well across all the access patterns and different types of workloads. Through this paper, we propose Arsenal, a prefetching framework which allows the advantages provided by different data prefetchers to be combined, by dynamically selecting the best-suited prefetcher for the current workload. Thus effectively improving the versatility of the prefetching system. It bases on the classic Sandbox prefetcher that dynamically adapts and utilizes multiple offsets for sequential prefetchers. We take it to the next step by switching between prefetchers like Multi look Ahead Offset Prefetching and Timing SKID Prefetcher on the run. Arsenal utilizes a space-efficient pooling filter, Bloom filters, that keeps track of useful prefetches of each of these component prefetchers and thus helps to maintain a score for each of the component prefetchers. This approach is shown to provide better speedup than anyone prefetcher alone. Arsenal provides a performance improvement of 44.29% on the single-core mixes and 19.5% for some of the selected 25 representative multi-core mixes.

Hardware Architecture

Asynchronous Early Output Block Carry Lookahead Adder with Improved Quality of Results

A new asynchronous early output block carry lookahead adder (BCLA) incorporating redundant carries is proposed. Compared to the best of existing semi-custom asynchronous carry lookahead adders (CLAs) employing delay-insensitive data encoding and following a 4-phase handshaking, the proposed BCLA with redundant carries achieves 13% reduction in forward latency and 14.8% reduction in cycle time compared to the best of the existing CLAs featuring redundant carries with no area or power penalty. A hybrid variant involving a ripple carry adder (RCA) in the least significant stages i.e. BCLA-RCA is also considered that achieves a further 4% reduction in the forward latency and a 2.4% reduction in the cycle time compared to the proposed BCLA featuring redundant carries without area or power penalties.

Hardware Architecture

AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. In this paper, we address this problem by incorporating an automated DSE framework-AutoDSE- that leverages bottleneck-guided gradient optimizer to systematically find abetter design point. AutoDSE finds the bottleneck of the design in each step and focuses on high-impact parameters to overcome that, which is similar to the approach an expert would take. The experimental results show that AutoDSE is able to find the design point that achieves, on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks and 1.04x over the manually designed HLS accelerated vision kernels in Xilinx Vitis libraries yet with 26x reduction of their optimization pragmas. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.

Hardware Architecture

Automated Circuit Approximation Method Driven by Data Distribution

We propose an application-tailored data-driven fully automated method for functional approximation of combinational circuits. We demonstrate how an application-level error metric such as the classification accuracy can be translated to a component-level error metric needed for an efficient and fast search in the space of approximate low-level components that are used in the application. This is possible by employing a weighted mean error distance (WMED) metric for steering the circuit approximation process which is conducted by means of genetic programming. WMED introduces a set of weights (calculated from the data distribution measured on a selected signal in a given application) determining the importance of each input vector for the approximation process. The method is evaluated using synthetic benchmarks and application-specific approximate MAC (multiply-and-accumulate) units that are designed to provide the best trade-offs between the classification accuracy and power consumption of two image classifiers based on neural networks.

Hardware Architecture

Automated Floorplanning for Partially Reconfigurable Designs on Heterogenrous FPGAs

Floorplanning problem has been extensively explored for homogeneous FPGAs. Most modern FPGAs consist of heterogeneous resources in the form of configurable logic blocks, DSP blocks, BRAMs and more. Very little work has been done for heterogeneous FPGAs. In addition, features like partial reconfigurability allow on-the-fly changes to the executable design that can result in enhanced performance and very efficient utilization of resources. In this paper, we have designed a floorplanner for Partially Reconfigurable (PR) designs in FPGA that smartly decides one of the three proposed resource allocation schemes to floorplan a particular type of reconfigurable region. We also propose a White Space Detection algorithm for efficient management of white space inside an FPGA in order to reduce the area and the wire length. The floorplanner is demonstrated on Xilinx Virtex 5 and Artix 7 FPGA architectures and can be easily integrated with existing vendor-supplied Place and Route tools. The main objective of the floorplanner is to reduce the wire length, minimize wasted resources and the area. The performance of our floorplanner is evaluated using MCNC benchmarks. We have compared our proposed floorplanner with other previously published results reported in the literature. We observe a substantial improvement in the overall wire length as well as the execution time. Also, the floorplanner was integrated with vendor supplied place and route tools (Xilinx Vivado) to automate the floorplanning flow. The automation process was tested on a partially reconfigurable median filter used in image processing applications.

Hardware Architecture

Automatic Conversion from Flip-flop to 3-phase Latch-based Designs

Latch-based designs have many benefits over their flip-flop based counterparts but have limited use partially because most RTL specifications are flop-centric and automatic conversion of FF to latch-based designs is challenging. Conventional conversion algorithms target master-slave latch-based designs with two non-overlapping clocks. This paper presents a novel automated design flow that converts flip-flop to 3-phase latch-based designs. The resulting circuits have the same performance as the master-slave based designs but require significantly less latches. Our experimental results demonstrate the potential for savings in the number of latches (21.3%), area (5.8%), and power (16.3%) on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to the master-slave conversions.

Hardware Architecture

Automatic Microprocessor Performance Bug Detection

Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running benchmarks is typically not deterministically known. Thus, when performance benchmarking new microarchitectures, performance teams may assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, despite significant performance regressions existing in the design. In this work, we present a two-stage, machine learning-based methodology that is able to detect the existence of performance bugs in microprocessors. Our results show that our best technique detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater than 1% versus a bug-free design with zero false positives. When evaluated on memory system bugs, our technique achieves 100% detection with zero false positives. Moreover, the detection is automatic, requiring very little performance engineer time.

Hardware Architecture

BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

Ready to get started?

Join us today

Archive Your Research