Featured Researches

Hardware Architecture

Combinatorics and Geometry for the Many-ported, Distributed and Shared Memory Architecture

Manycore SoC architectures based on on-chip shared memory are preferred for flexible and programmable solutions in many application domains. However, the development of many ported memory is becoming increasingly challenging as we approach the end of Moore's Law while systems requirements demand larger shared memory and more access ports. Memory can no longer be designed simply to minimize single transaction access time, but must take into account the functionality on the SoC. In this paper we examine a common large memory usage in SoC, where the memory is used as storage for large buffers that are then moved for time scheduled processing. We merge two aspects of many ported memory design, combinatorial analysis of interconnect, and geometric analysis of critical paths, extending both to show that in this case the SoC performance benefits significantly from a hierarchical, distributed and staged architecture with lower-radix switches and fractal randomization of memory bank addressing, along with judicious and geometry aware application of speed up. The results presented show the new architecture supports 20% higher throughput with 20% lower latency and 30% less interconnection area at approximately the same power consumption. We demonstrate the flexibility and scalability of this architecture on silicon from a physical design perspective by taking the design through layout. The architecture enables a much easier implementation flow that works well with physically irregular port access and memory dominant layout, which is a common issue in real designs.

Read more
Hardware Architecture

Comparative Analysis of Polynomial and Rational Approximations of Hyperbolic Tangent Function for VLSI Implementation

Deep neural networks yield the state-of-the-art results in many computer vision and human machine interface applications such as object detection, speech recognition etc. Since, these networks are computationally expensive, customized accelerators are designed for achieving the required performance at lower cost and power. One of the key building blocks of these neural networks is non-linear activation function such as sigmoid, hyperbolic tangent (tanh), and ReLU. A low complexity accurate hardware implementation of the activation function is required to meet the performance and area targets of the neural network accelerators. Even though, various methods and implementations of tanh activation function have been published, a comparative study is missing. This paper presents comparative analysis of polynomial and rational methods and their hardware implementation.

Read more
Hardware Architecture

Comparing quaternary and binary multipliers

We compare the implementation of a 8x8 bit multiplier with two different implementations of a 4x4 quaternary digit multiplier. Interfacing this binary multiplier with quaternary to binary decoders and binary to quaternary encoders leads to a 4x4 multiplier that outperforms the best direct implementation of a 4x4 quaternary multiplier. The far greater complexity of the 1-digit multipliers and 1-digit adders used in this direct implementation compared to the binary 1-bit multipliers and full adders cannot be compensated by the reduced count of quaternary operators. As the best quaternary multiplier includes the corresponding binary one, it means that there is no opportunity to get less interconnects, less chip area, less power dissipation with the quaternary multiplier.

Read more
Hardware Architecture

Comparing ternary and binary adders and multipliers

While many papers have proposed implementations of ternary adders and ternary multipliers, no comparisons have generally been done with the corresponding binary ones. We compare the implementations of binary and ternary adders and multipliers with the same computing capability according to the basic blocks that are 1-bit and 1-trit adders and 1-bit and 1-trit multipliers. Then we compare the complexity of these basic blocks by using the same CNTFET technology to evaluate the overall complexity of N-bit adders and M-trit adders on one side, and NxN bit multipliers and MxM trits multipliers with M = N/IR (IR = log(3)/log(2) is the information ratio). While ternary adders and multipliers have less input and output connections and use less basic building blocks, the complexity of the ternary building blocks is too high and the ternary adders and multipliers cannot compete with the binary ones.

Read more
Hardware Architecture

Compiler Directed Speculative Intermittent Computation

This paper presents CoSpec, a new architecture/compiler co-design scheme that works for commodity in-order processors used in energy-harvesting systems. To achieve crash consistency without requiring unconventional architectural support, CoSpec leverages speculation assuming that power failure is not going to occur and thus holds all committed stores in a store buffer (SB), as if they were speculative, in case of mispeculation. CoSpec compiler first partitions a given program into a series of recoverable code regions with the SB size in mind, so that no region overflows the SB. When the program control reaches the end of each region, the speculation turns out to be successful, thus releasing all the buffered stores of the region to NVM. If power failure occurs during the execution of a region, all its speculative stores disappear in the volatile SB, i.e., they never affect program states in NVM. Consequently, the interrupted region can be restarted with consistent program states in the wake of power failure. To hide the latency of the SB release, i.e., essentially NVM writes, at each region boundary, CoSpec overlaps the NVM writes of the current region with the speculative execution of the next region. Such instruction level parallelism gives an illusion of out-of-order execution on top of the in-order processor, achieving a speedup of more than 1.2X when there is no power outage. Our experiments on a set of real energy harvesting traces with frequent outages demonstrate that CoSpec outperforms the state-of-the-art scheme by 1.8~3X on average.

Read more
Hardware Architecture

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the DNN computations to reduce data movement from DRAM to the chip. The reuse is captured by the accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, the strategy for assigning on-chip hardware resources (i.e., compute and memory) given a dataflow that can optimize for performance/energy while meeting platform constraints of area/power for DNN(s) of interest is still relatively unexplored. The design-space of choices for balancing compute and memory explodes combinatorially, as we show in this work (e.g., as large as O(10^(72)) choices for running \mobilenet), making it infeasible to do manual-tuning via exhaustive searches. It is also difficult to come up with a specific heuristic given that different DNNs and layer types exhibit different amounts of reuse. In this paper, we propose an autonomous strategy called ConfuciuX to find optimized HW resource assignments for a given model and dataflow style. ConfuciuX leverages a reinforcement learning method, REINFORCE, to guide the search process, leveraging a detailed HW performance cost model within the training loop to estimate rewards. We also augment the RL approach with a genetic algorithm for further fine-tuning. ConfuciuX demonstrates the highest sample-efficiency for training compared to other techniques such as Bayesian optimization, genetic algorithm, simulated annealing, and other RL methods. It converges to the optimized hardware configuration 4.7 to 24 times faster than alternate techniques.

Read more
Hardware Architecture

Coordinated Management of DVFS and Cache Partitioning under QoS Constraints to Save Energy in Multi-Core Systems

Reducing the energy expended to carry out a computational task is important. In this work, we explore the prospects of meeting Quality-of-Service requirements of tasks on a multi-core system while adjusting resources to expend a minimum of energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of every application in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and Dynamic Voltage Frequency Scaling (DVFS) settings at runtime at negligible overhead. We show that the energy of 4-core and 8-core systems can be reduced by up to 18% and 14%, respectively, compared to a baseline with even distribution of cache resources and a fixed mid-range core voltage-frequency setting. The energy savings can potentially reach 29% if the QoS targets are relaxed to 40% longer execution time.

Read more
Hardware Architecture

Coordinated Management of Processor Configuration and Cache Partitioning to Optimize Energy under QoS Constraints

An effective way to improve energy efficiency is to throttle hardware resources to meet a certain performance target, specified as a QoS constraint, associated with all applications running on a multicore system. Prior art has proposed resource management (RM) frameworks in which the share of the last-level cache (LLC) assigned to each processor and the voltage-frequency (VF) setting for each processor is managed in a coordinated fashion to reduce energy. A drawback of such a scheme is that, while one core gives up LLC resources for another core, the performance drop must be compensated by a higher VF setting which leads to a quadratic increase in energy consumption. By allowing each core to be adapted to exploit instruction and memory-level parallelism (ILP/MLP), substantially higher energy savings are enabled. This paper proposes a coordinated RM for LLC partitioning, processor adaptation, and per-core VF scaling. A first contribution is a systematic study of the resource trade-offs enabled when trading between the three classes of resources in a coordinated fashion. A second contribution is a new RM framework that utilizes these trade-offs to save more energy. Finally, a challenge to accurately model the impact of resource throttling on performance is to predict the amount of MLP with high accuracy. To this end, the paper contributes with a mechanism that estimates the effect of MLP over different processor configurations and LLC allocations. Overall, we show that up to 18% of energy, and on average 10%, can be saved using the proposed scheme.

Read more
Hardware Architecture

Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads

Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for developing DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads.

Read more
Hardware Architecture

Coprocessors: failures and successes

The appearance and disappearance of coprocessors by integration into the CPU, the success or failure of coprocessors are examined by summarizing their characteristics from the mainframes of the 1960s. The coprocessors most particularly reviewed are the IBM 360 and CDC-6600 I/O processors, the Intel 8087 math coprocessor, the Cell processor, the Intel Xeon Phi coprocessors, the GPUs, the FPGAs, and the coprocessors of manycores SW26010 and Pezy SC-2 used in high-ranked supercomputers in the TOP500 or Green500. The conditions for a coprocessor to be viable in the medium or long-term are defined.

Read more

Ready to get started?

Join us today