Computer Science Hardware Architecture - Researchain

Featured Researches

Cross-Stack Workload Characterization of Deep Recommendation Systems

Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.

Hardware Architecture

CrossStack: A 3-D Reconfigurable RRAM Crossbar Inference Engine

Deep neural network inference accelerators are rapidly growing in importance as we turn to massively parallelized processing beyond GPUs and ASICs. The dominant operation in feedforward inference is the multiply-and-accumlate process, where each column in a crossbar generates the current response of a single neuron. As a result, memristor crossbar arrays parallelize inference and image processing tasks very efficiently. In this brief, we present a 3-D active memristor crossbar array `CrossStack', which adopts stacked pairs of Al/TiO2/TiO2-x/Al devices with common middle electrodes. By designing CMOS-memristor hybrid cells used in the layout of the array, CrossStack can operate in one of two user-configurable modes as a reconfigurable inference engine: 1) expansion mode and 2) deep-net mode. In expansion mode, the resolution of the network is doubled by increasing the number of inputs for a given chip area, reducing IR drop by 22%. In deep-net mode, inference speed per-10-bit convolution is improved by 29\% by simultaneously using one TiO2/TiO2-x layer for read processes, and the other for write processes. We experimentally verify both modes on our 10?10?2 array.

Hardware Architecture

Custom Tailored Suite of Random Forests for Prefetcher Adaptation

To close the gap between memory and processors, and in turn improve performance, there has been an abundance of work in the area of data/instruction prefetcher designs. Prefetchers are deployed in each level of the memory hierarchy, but typically, each prefetcher gets designed without comprehensively accounting for other prefetchers in the system. As a result, these individual prefetcher designs do not always complement each other, and that leads to low average performance gains and/or many negative outliers. In this work, we propose SuitAP (Suite of random forests for Adaptation of Prefetcher system configuration), which is a hardware prefetcher adapter that uses a suite of random forests to determine at runtime which prefetcher should be ON at each memory level, such that they complement each other. Compared to a design with no prefetchers, using SuitAP we improve IPC by 46% on average across traces generated from SPEC2017 suite with 12KB overhead. Moreover, we also reduce negative outliers using SuitAP.

Hardware Architecture

Customizing Trusted AI Accelerators for Efficient Privacy-Preserving Machine Learning

The use of trusted hardware has become a promising solution to enable privacy-preserving machine learning. In particular, users can upload their private data and models to a hardware-enforced trusted execution environment (e.g. an enclave in Intel SGX-enabled CPUs) and run machine learning tasks in it with confidentiality and integrity guaranteed. To improve performance, AI accelerators have been widely employed for modern machine learning tasks. However, how to protect privacy on an AI accelerator remains an open question. To address this question, we propose a solution for efficient privacy-preserving machine learning based on an unmodified trusted CPU and a customized trusted AI accelerator. We carefully leverage cryptographic primitives to establish trust and protect the channel between the CPU and the accelerator. As a case study, we demonstrate our solution based on the open-source versatile tensor accelerator. The result of evaluation shows that the proposed solution provides efficient privacy-preserving machine learning at a small design cost and moderate performance overhead.

Hardware Architecture

CuttleSys: Data-Driven Resource Management forInteractive Applications on Reconfigurable Multicores

Multi-tenancy for latency-critical applications leads to re-source interference and unpredictable performance. Core reconfiguration opens up more opportunities for colocation,as it allows the hardware to adjust to the dynamic performance and power needs of a specific mix of co-scheduled applications. However, reconfigurability also introduces challenges, as even for a small number of reconfigurable cores, exploring the design space becomes more time- and resource-demanding. We present CuttleSys, a runtime for reconfigurable multi-cores that leverages scalable and lightweight data mining to quickly identify suitable core and cache configurations for a set of co-scheduled applications. The runtime combines collaborative filtering to infer the behavior of each job on every core and cache configuration, with Dynamically Dimensioned Search to efficiently explore the configuration space. We evaluate CuttleSys on multicores with tens of reconfigurable cores and show up to 2.46x and 1.55x performance improvements compared to core-level gating and oracle-like asymmetric multicores respectively, under stringent power constraints.

Hardware Architecture

Cycle-Accurate Evaluation of Software-Hardware Co-Design of Decimal Computation in RISC-V Ecosystem

Software-hardware co-design solutions for decimal computation can provide several Pareto points to development of embedded systems in terms of hardware cost and performance. This paper demonstrates how to accurately evaluate such co-design solutions using RISC-V ecosystem. In a software-hardware co-design solution, a part of solution requires dedicated hardware. In our evaluation framework, we develop new decimal oriented instructions supported by an accelerator. The framework can realize cycle-accurate analysis for performance as well as hardware overhead for co-design solutions for decimal computation. The obtained performance result is compared with an estimation with dummy functions.

Hardware Architecture

Cyclic Sequence Generators as Program Counters for High-Speed FPGA-based Processors

This paper compares the performance of conventional radix-2 program counters with program counters based on Feedback Shift Registers (FSRs), a class of cyclic sequence generator. FSR counters have constant time scaling with bit-width, N , whereas FPGA-based radix-2 counters typically have O(N) time-complexity due to the carry-chain. Program counter performance is measured by synthesis of standalone counter circuits, as well as synthesis of three FPGA-based processor designs modified to incorporate FSR program counters. Hybrid counters, combining both an FSR and a radix-2 counter, are presented as a solution to the potential cache-coherency issues of FSR program counters. Results show that high-speed processor designs benefit more from FSR program counters, allowing both greater operating frequency and the use of fewer logic resources.

Hardware Architecture

DAMO: Deep Agile Mask Optimization for Full Chip Scale

Continuous scaling of the VLSI system leaves a great challenge on manufacturing and optical proximity correction (OPC) is widely applied in conventional design flow for manufacturability optimization. Traditional techniques conducted OPC by leveraging a lithography model and suffered from prohibitive computational overhead, and mostly focused on optimizing a single clip without addressing how to tackle the full chip. In this paper, we present DAMO, a high performance and scalable deep learning-enabled OPC system for full chip scale. It is an end-to-end mask optimization paradigm which contains a Deep Lithography Simulator (DLS) for lithography modeling and a Deep Mask Generator (DMG) for mask pattern generation. Moreover, a novel layout splitting algorithm customized for DAMO is proposed to handle the full chip OPC problem. Extensive experiments show that DAMO outperforms the state-of-the-art OPC solutions in both academia and industrial commercial toolkit.

Hardware Architecture

DB4HLS: A Database of High-Level Synthesis Design Space Explorations

High-Level Synthesis (HLS) frameworks allow to easily specify a large number of variants of the same hardware design by only acting on optimization directives. Nonetheless, the hardware synthesis of implementations for all possible combinations of directive values is impractical even for simple designs. Addressing this shortcoming, many HLS Design Space Exploration (DSE) strategies have been proposed to devise directive settings leading to high-quality implementations while limiting the number of synthesis runs. All these works require considerable efforts to validate the proposed strategies and/or to build the knowledge base employed to tune abstract models, as both tasks mandate the syntheses of large collections of implementations. Currently, such data gathering is performed ad-hoc, a) leading to a lack of standardization, hampering comparisons between DSE alternatives, and b) posing a very high burden to researchers willing to develop novel DSE strategies. Against this backdrop, we here introduce DB4HLS, a database of exhaustive HLS explorations comprising more than 100000 design points collected over 4 years of synthesis time. The open structure of DB4HLS allows the incremental integration of new DSEs, which can be easily defined with a dedicated domain-specific language. We think that of our database, available at this https URL, will be a valuable tool for the research community investigating automated strategies for the optimization of HLS-based hardware designs.

Hardware Architecture

DMR-based Technique for Fault Tolerant AES S-box Architecture

This paper presents a high-throughput fault-resilient hardware implementation of AES S-box, called HFS-box. If a transient natural or even malicious fault in each pipeline stage is detected, the corresponding error signal becomes high and as a result, the control unit holds the output of our proposed DMR voter till the fault effect disappears. The proposed low-cost HFS-box provides a high capability of fault-tolerant against transient faults with any duration by putting low area overhead, i.e. 137%, and low throughput degradation, i.e. 11.3%, on the original implementation.

Ready to get started?

Join us today

Archive Your Research