Computer Science Hardware Architecture - Researchain

Featured Researches

Accelerating Genome Analysis: A Primer on an Ongoing Journey

Genome analysis fundamentally starts with a process known as read mapping, where sequenced fragments of an organism's genome are compared against a reference genome. Read mapping is currently a major bottleneck in the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are able to sequence a genome much faster than the computational techniques employed to analyze the genome. We describe the ongoing journey in significantly improving the performance of read mapping. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory). We conclude with the challenges of adopting these hardware-accelerated read mappers.

Hardware Architecture

Accelerating Recommender Systems via Hardware "scale-in"

In today's era of "scale-out", this paper makes the case that a specialized hardware architecture based on "scale-in"--placing as many specialized processors as possible along with their memory systems and interconnect links within one or two boards in a rack--would offer the potential to boost large recommender system throughput by 12-62x for inference and 12-45x for training compared to the DGX-2 state-of-the-art AI platform, while minimizing the performance impact of distributing large models across multiple processors. By analyzing Facebook's representative model--Deep Learning Recommendation Model (DLRM)--from a hardware architecture perspective, we quantify the impact on throughput of hardware parameters such as memory system design, collective communications latency and bandwidth, and interconnect topology. By focusing on conditions that stress hardware, our analysis reveals limitations of existing AI accelerators and hardware platforms.

Hardware Architecture

Accelerating Transient Fault Injection Campaigns by using Dynamic HDL Slicing

Along with the complexity of electronic systems for safety-critical applications, the cost of safety mechanisms evaluation by fault injection simulation is rapidly going up. To reduce these efforts, we propose a fault injection methodology where Hardware Description Language (HDL) code slicing is exploited to accelerate transient fault injection campaigns by pruning fault lists and reducing the number of the injections. In particular, the dynamic HDL slicing technique provides for a critical fault list and allows avoiding injections at non-critical time-steps. Experimental results on an industrial core show that the proposed methodology can successfully reduce the number of injections by up to 10 percent and speed-up the fault injection campaigns.

Hardware Architecture

Accelerating Viterbi Algorithm using Custom Instruction Approach

In recent years, the decoding algorithms in communication networks are becoming increasingly complex aiming to achieve high reliability in correctly decoding received messages. These decoding algorithms involve computationally complex operations requiring high performance computing hardware, which are generally expensive. A cost-effective solution is to enhance the Instruction Set Architecture (ISA) of the processors by creating new custom instructions for the computational parts of the decoding algorithms. In this paper, we propose to utilize the custom instruction approach to efficiently implement the widely used Viterbi decoding algorithm by adding the assembly language instructions to the ISA of DLX, PicoJava II and NIOS II processors, which represent RISC, stack and FPGA-based soft-core processor architectures, respectively. By using the custom instruction approach, the execution time of the Viterbi algorithm is significantly improved by approximately 3 times for DLX and PicoJava II, and by 2 times for NIOS II.

Hardware Architecture

Achieving Multi-Port Memory Performance on Single-Port Memory with Coding Techniques

Many performance critical systems today must rely on performance enhancements, such as multi-port memories, to keep up with the increasing demand of memory-access capacity. However, the large area footprints and complexity of existing multi-port memory designs limit their applicability. This paper explores a coding theoretic framework to address this problem. In particular, this paper introduces a framework to encode data across multiple single-port memory banks in order to {\em algorithmically} realize the functionality of multi-port memory. This paper proposes three code designs with significantly less storage overhead compared to the existing replication based emulations of multi-port memories. To further improve performance, we also demonstrate a memory controller design that utilizes redundancy across coded memory banks to more efficiently schedule read and write requests sent across multiple cores. Furthermore, guided by DRAM traces, the paper explores {\em dynamic coding} techniques to improve the efficiency of the coding based memory design. We then show significant performance improvements in critical word read and write latency in the proposed coded-memory design when compared to a traditional uncoded-memory design.

Hardware Architecture

Adaptive Multi-bit SRAM Topology Based Analog PUF

Physically Unclonable Functions (PUFs) are lightweight cryptographic primitives for generating unique signatures from minuscule manufacturing variations. In this work, we present lightweight, area efficient and low power adaptive multi-bit SRAM topology based Current Mirror Array (CMA) analog PUF design for securing the sensor nodes, authentication and key generation. The proposed Strong PUF increases the complexity of the machine learning attacks thus making it difficult for the adversary. The design is based on scl180 library.

Hardware Architecture

Addressing Resiliency of In-Memory Floating Point Computation

In-memory computing (IMC) can eliminate the data movement between processor and memory which is a barrier to the energy-efficiency and performance in Von-Neumann computing. Resistive RAM (RRAM) is one of the promising devices for IMC applications (e.g. integer and Floating Point (FP) operations and random logic implementation) due to low power consumption, fast operation, and small footprint in crossbar architecture. In this paper, we propose FAME, a pipelined FP arithmetic (adder/subtractor) using RRAM crossbar based IMC. A novel shift circuitry is proposed to lower the shift overhead during FP operations. Since 96% of the RRAMs used in our architecture are in High Resistance State (HRS), we propose two approaches namely Shift-At-The-Output (SATO) and Force To VDD (FTV) (ground (FTG)) to mitigate Stuck-at-1 (SA1) failures. In both techniques, the fault-free RRAMs are exploited to perform the computation by using an extra clock cycle. Although performance degrades by 50%, SATO can handle 50% of the faults whereas FTV can handle 99% of the faults in the RRAM-based compute array at low power and area overhead. Simulation results show that the proposed single precision FP adder consumes 335 pJ and 322 pJ for NAND-NAND and NOR-NOR based implementations, respectively. The area overheads of SATO and FTV are 28.5% and 9.5%, respectively.

Hardware Architecture

Addressing Variability in Reuse Prediction for Last-Level Caches

Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. However, applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks that will not be referenced again, leading to inefficient utilization of the LLC capacity. To improve cache efficiency, the state-of-the-art cache management techniques employ prediction mechanisms that learn from the past access patterns with an aim to accurately identify as many dead blocks as possible. Once identified, dead blocks are evicted from LLC to make space for potentially high reuse cache blocks. In this thesis, we identify variability in the reuse behavior of cache blocks as the key limiting factor in maximizing cache efficiency for state-of-the-art predictive techniques. Variability in reuse prediction is inevitable due to numerous factors that are outside the control of LLC. The sources of variability include control-flow variation, speculative execution and contention from cores sharing the cache, among others. Variability in reuse prediction challenges existing techniques in reliably identifying the end of a block's useful lifetime, thus causing lower prediction accuracy, coverage, or both. To address this challenge, this thesis aims to design robust cache management mechanisms and policies for LLC in the face of variability in reuse prediction to minimize cache misses, while keeping the cost and complexity of the hardware implementation low. To that end, we propose two cache management techniques, one domain-agnostic and one domain-specialized, to improve cache efficiency by addressing variability in reuse prediction.

Hardware Architecture

Addressing multiple bit/symbol errors in DRAM subsystem

As DRAM technology continues to evolve towards smaller feature sizes and increased densities, faults in DRAM subsystem are becoming more severe. Current servers mostly use CHIPKILL based schemes to tolerate up-to one/two symbol errors per DRAM beat. Multi-symbol errors arising due to faults in multiple data buses and chips may not be detected by these schemes. In this paper, we introduce Single Symbol Correction Multiple Symbol Detection (SSCMSD) - a novel error handling scheme to correct single-symbol errors and detect multi-symbol errors. Our scheme makes use of a hash in combination with Error Correcting Code (ECC) to avoid silent data corruptions (SDCs). SSCMSD can also enhance the capability of detecting errors in address bits. We employ 32-bit CRC along with Reed-Solomon code to implement SSCMSD for a x4 based DDRx system. Our simulations show that the proposed scheme effectively prevents SDCs in the presence of multiple symbol errors. Our novel design enabled us to achieve this without introducing additional READ latency. Also, we need 19 chips per rank (storage overhead of 18.75 percent), 76 data bus-lines and additional hash-logic at the memory controller.

Hardware Architecture

Agile SoC Development with Open ESP

ESP is an open-source research platform for heterogeneous SoC design. The platform combines a modular tile-based architecture with a variety of application-oriented flows for the design and optimization of accelerators. The ESP architecture is highly scalable and strikes a balance between regularity and specialization. The companion methodology raises the level of abstraction to system-level design and enables an automated flow from software and hardware development to full-system prototyping on FPGA. For application developers, ESP offers domain-specific automated solutions to synthesize new accelerators for their software and to map complex workloads onto the SoC architecture. For hardware engineers, ESP offers automated solutions to integrate their accelerator designs into the complete SoC. Conceived as a heterogeneous integration platform and tested through years of teaching at Columbia University, ESP supports the open-source hardware community by providing a flexible platform for agile SoC development.

Ready to get started?

Join us today

Archive Your Research