Featured Researches

Hardware Architecture

A Survey on Recent Hardware Data Prefetching Approaches with An Emphasis on Servers

Data prefetching, i.e., the act of predicting application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way. In this survey, we discuss the fundamental concepts in data prefetching and study state-of-the-art hardware data prefetching approaches. Additional Key Words and Phrases: Data Prefetching, Scale-Out Workloads, Server Processors, and Spatio-Temporal Correlation.

Read more
Hardware Architecture

A Survey on Tiering and Caching in High-Performance Storage Systems

Although every individual invented storage technology made a big step towards perfection, none of them is spotless. Different data store essentials such as performance, availability, and recovery requirements have not met together in a single economically affordable medium, yet. One of the most influential factors is price. So, there has always been a trade-off between having a desired set of storage choices and the costs. To address this issue, a network of various types of storing media is used to deliver the high performance of expensive devices such as solid state drives and non-volatile memories, along with the high capacity of inexpensive ones like hard disk drives. In software, caching and tiering are long-established concepts for handling file operations and moving data automatically within such a storage network and manage data backup in low-cost media. Intelligently moving data around different devices based on the needs is the key insight for this matter. In this survey, we discuss some recent pieces of research that have been done to improve high-performance storage systems with caching and tiering techniques.

Read more
Hardware Architecture

A Unified Learning Platform for Dynamic Frequency Scaling in Pipelined Processors

A machine learning (ML) design framework is proposed for dynamically adjusting clock frequency based on propagation delay of individual instructions. A Random Forest model is trained to classify propagation delays in real-time, utilizing current operation type, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipeline stage within a baseline processor. The modified system is simulated at the gate-level in 45 nm CMOS technology, exhibiting a speed-up of 68% and energy reduction of 37% with coarse-grained ML classification. A speed-up of 95% is demonstrated with finer granularities at additional energy costs.

Read more
Hardware Architecture

A Unified Model for Gate Level Propagation Analysis

Classic hardware verification techniques (e.g., X-propagation and fault-propagation) and more recent hardware security verification techniques based on information flow tracking (IFT) aim to understand how information passes, affects, and otherwise modifies a circuit. These techniques all have separate usage scenarios, but when dissected into their core functionality, they relate in a fundamental manner. In this paper, we develop a common framework for gate level propagation analysis. We use our model to generate synthesizable propagation logic to use in standard EDA tools. To justify our model, we prove that Precise Hardware IFT is equivalent to gate level X-propagation and imprecise fault propagation. We also show that the difference between Precise Hardware IFT and fault propagation is not significant for 74X-series and '85 ISCAS benchmarks with more than 313 gates and the difference between imprecise hardware IFT and Precise Hardware IFT is almost always significant regardless of size.

Read more
Hardware Architecture

A Variable Vector Length SIMD Architecture for HW/SW Co-designed Processors

Hardware/Software (HW/SW) co-designed processors provide a promising solution to the power and complexity problems of the modern microprocessors by keeping their hardware simple. Moreover, they employ several runtime optimizations to improve the performance. One of the most potent optimizations, vectorization, has been utilized by modern microprocessors, to exploit the data level parallelism through SIMD accelerators. Due to their hardware simplicity, these accelerators have evolved in terms of width from 64-bit vectors in Intel MMX to 512-bit wide vector units in Intel Xeon Phi and AVX-512. Although SIMD accelerators are simple in terms of hardware design, code generation for them has always been a challenge. Moreover, increasing vector lengths with each new generation add to this complexity. This paper explores the scalability of SIMD accelerators from the code generation point of view. We discover that the SIMD accelerators remain underutilized at higher vector lengths mainly due to: a) reduced dynamic instruction stream coverage for vectorization and b) increase in permutations. Both of these factors can be attributed to the rigidness of the SIMD architecture. We propose a novel SIMD architecture that possesses the flexibility needed to support higher vector lengths. Furthermore, we propose Variable Length Vectorization and Selective Writing in a HW/SW co-designed environment to transparently target the flexibility of the proposed architecture. We evaluate our proposals using a set of SPECFP2006 and Physicsbench applications. Our experimental results show an average dynamic instruction reduction of 31% and 40% and an average speed up of 13% and 10% for SPECFP2006 and Physicsbench respectively, for 512-bit vector length, over the scalar baseline code.

Read more
Hardware Architecture

A bi-directional Address-Event transceiver block for low-latency inter-chip communication in neuromorphic systems

Neuromorphic systems typically use the Address-Event Representation (AER) to transmit signals among nodes, cores, and chips. Communication of Address-Events (AEs) between neuromorphic cores/chips typically requires two parallel digital signal buses for Input/Output (I/O) operations. This requirement can become very expensive for large-scale systems in terms of both dedicated I/O pins and power consumption. In this paper we present a compact fully asynchronous event-driven transmitter/receiver block that is both power efficient and I/O efficient. This block implements high-throughput low-latency bi-directional communication through a parallel AER bus. We show that by placing the proposed AE transceiver block in two separate chips and linking them by a single AER bus, we can drive the communication and switch the transmission direction of the shared bus on a single event basis, from either side with low-latency. We present experimental results that validate the circuits proposed and demonstrate reliable bi-directional event transmission with high-throughput. The proposed AE block, integrated in a neuromorphic chip fabricated using a 28 nm FDSOI process, occupies a silicon die area of 140 {\mu}m x 70 {\mu}m. The experimental measurements show that the event-driven AE block combined with standard digital I/Os has a direction switch latency of 5 ns and can achieve a worst-case bi-directional event transmission throughput of 28.6M Events/second while consuming 11 pJ per event (26-bit) delivery.

Read more
Hardware Architecture

A distributed memory, local configuration technique for re-configurable logic designs

The use and location of memory in integrated circuits plays a key factor in their performance. Memory requires large physical area, access times limit overall system performance and connectivity can result in large fan-out. Modern FPGA systems and ASICs contain an area of memory used to set the operation of the device from a series of commands set by a host. Implementing these settings registers requires a level of care otherwise the resulting implementation can result in a number of large fan-out nets that consume valuable resources complicating the placement of timing critical pathways. This paper presents an architecture for implementing and programming these settings registers in a distributed method across an FPGA and how the presented architecture works in both clock-domain crossing and dynamic partial re-configuration applications. The design is compared to that of a `global' settings register architecture. We implement the architectures using Intel FPGAs Quartus Prime software targeting an Intel FPGA Cyclone V. It is shown that the distributed memory architecture has a smaller resource cost (as small as 25% of the ALMs and 20% of the registers) compared to the global memory architectures.

Read more
Hardware Architecture

A high-performance MEMRISTOR-based Smith-Waterman DNA sequence alignment Using FPNI structure

This paper aims to present a new re-configuration sequencing method for difference of read lengths that may take place as input data in which is crucial drawbacks lay impact on DNA sequencing methods.

Read more
Hardware Architecture

A low-overhead soft-hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.

Read more
Hardware Architecture

A portable and Linux capable RISC-V computer system in Verilog HDL

RISC-V is an open and royalty free instruction set architecture which has been developed at the University of California, Berkeley. The processors using RISC-V can be designed and released freely. Because of this, various processor cores and system on chips (SoCs) have been released so far. However, there are a few public RISC-V computer systems that are portable and can boot Linux operating systems. In this paper, we describe a portable and Linux capable RISC-V computer system targeting FPGAs in Verilog HDL. This system can be implemented on an FPGA with fewer hardware resources, and can be implemented on low cost FPGAs or customized by introducing an accelerator. This paper also describes the knowledge obtained through the development of this RISC-V computer system.

Read more

Ready to get started?

Join us today