Featured Researches

Hardware Architecture

CLARINET: A RISC-V Based Framework for Posit Arithmetic Empiricism

Many engineering and scientific applications require high precision arithmetic. IEEE 754-2008 compliant (floating-point) arithmetic is the de facto standard for performing these computations. Recently, posit arithmetic has been proposed as a drop-in replacement for floating-point arithmetic. The posit data representation and arithmetic offer several absolute advantages over the floating-point format and arithmetic including higher dynamic range, better accuracy, and superior performance-area trade-offs. In this paper, we present a consolidated general-purpose processor-based framework to support posit arithmetic empiricism. The end-users of the framework have the liberty to seamlessly experiment with their applications using posit and floating-point arithmetic since the framework is designed for the two number systems to coexist. The framework consists of Melodica and Clarinet. Melodica is a posit arithmetic core that implements parametric fused-multiply-accumulate and, more importantly, supports the quire data type. Clarinet is a Melodica-enabled processor based on the RISC-V ISA. To the best of our knowledge, this is the first-ever integration of quire to a RISC-V core. To show the effectiveness of the Clarinet platform, we perform an extensive application study and benchmarking on some of the common linear algebra and computer vision kernels. We perform ASIC synthesis of Clarinet and Melodica on a 90 nm-CMOS Faraday process. Finally, based on our analysis and synthesis results, we define a quality metric for the different instances of Clarinet that gives us initial recommendations on the goodness of the instances. Clarinet-Melodica is an easy-to-experiment platform that will be made available in open-source for posit arithmetic empiricism.

Read more
Hardware Architecture

CLR-DRAM: A Low-Cost DRAM Architecture Enabling Dynamic Capacity-Latency Trade-Off

DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-off is fixed at design time, previous works cannot achieve maximum performance under very different and dynamic workload demands. This paper proposes Capacity-Latency-Reconfigurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-off at low cost. CLR-DRAM allows dynamic reconfiguration of any DRAM row to switch between two operating modes: 1) max-capacity mode, where every DRAM cell operates individually to achieve approximately the same storage density as a density-optimized commodity DRAM chip and 2) high-performance mode, where two adjacent DRAM cells in a DRAM row and their sense amplifiers are coupled to operate as a single low-latency logical cell driven by a single logical sense amplifier. We implement CLR-DRAM by adding isolation transistors in each DRAM subarray. Our evaluations show that CLR-DRAM can improve system performance and DRAM energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed workloads. We believe that CLR-DRAM opens new research directions for a system to adapt to the diverse and dynamically changing memory capacity and access latency demands of workloads.

Read more
Hardware Architecture

CONTRA: Area-Constrained Technology Mapping Framework For Memristive Memory Processing Unit

Data-intensive applications are poised to benefit directly from processing-in-memory platforms, such as memristive Memory Processing Units, which allow leveraging data locality and performing stateful logic operations. Developing design automation flows for such platforms is a challenging and highly relevant research problem. In this work, we investigate the problem of minimizing delay under arbitrary area constraint for MAGIC-based in-memory computing platforms. We propose an end-to-end area constrained technology mapping framework, CONTRA. CONTRA uses Look-Up Table(LUT) based mapping of the input function on the crossbar array to maximize parallel operations and uses a novel search technique to move data optimally inside the array. CONTRA supports benchmarks in a variety of formats, along with crossbar dimensions as input to generate MAGIC instructions. CONTRA scales for large benchmarks, as demonstrated by our experiments. CONTRA allows mapping benchmarks to smaller crossbar dimensions than achieved by any other technique before, while allowing a wide variety of area-delay trade-offs. CONTRA improves the composite metric of area-delay product by 2.1x to 13.1x compared to seven existing technology mapping approaches.

Read more
Hardware Architecture

CUTIE: Beyond PetaOp/s/W Ternary DNN Inference Acceleration with Better-than-Binary Energy Efficiency

We present a 3.1 POp/s/W fully digital hardware accelerator for ternary neural networks. CUTIE, the Completely Unrolled Ternary Inference Engine, focuses on minimizing non-computational energy and switching activity so that dynamic power spent on storing (locally or globally) intermediate results is minimized. This is achieved by 1) a data path architecture completely unrolled in the feature map and filter dimensions to reduce switching activity by favoring silencing over iterative computation and maximizing data re-use, 2) targeting ternary neural networks which, in contrast to binary NNs, allow for sparse weights which reduce switching activity, and 3) introducing an optimized training method for higher sparsity of the filter weights, resulting in a further reduction of the switching activity. Compared with state-of-the-art accelerators, CUTIE achieves greater or equal accuracy while decreasing the overall core inference energy cost by a factor of 4.8x-21x.

Read more
Hardware Architecture

Cache Bypassing for Machine Learning Algorithms

Graphics Processing Units (GPUs) were once used solely for graphical computation tasks but with the increase in the use of machine learning applications, the use of GPUs to perform general-purpose computing has increased in the last few years. GPUs employ a massive amount of threads, that in turn achieve a high amount of parallelism, to perform tasks. Though GPUs have a high amount of computation power, they face the problem of cache contention due to the SIMT model that they use. A solution to this problem is called "cache bypassing". This paper presents a predictive model that analyzes the access patterns of various machine learning algorithms and determines whether certain data should be stored in the cache or not. It presents insights on how well each model performs on different datasets and also shows how minimizing the size of each model will affect its performance The performance of most of the models were found to be around 90% with KNN performing the best but not with the smallest size. We further increase the features by splitting the addresses into chunks of 4 bytes. We observe that this increased the performance of the neural network substantially and increased the accuracy to 99.9% with three neurons.

Read more
Hardware Architecture

Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices' limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain - a compiler that targets SCAMP-5, a general-purpose FPSP - which generates code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5.

Read more
Hardware Architecture

Characteristics of Reversible Circuits for Error Detection

In this work, we consider error detection via simulation for reversible circuit architectures. We rigorously prove that reversibility augments the performance of this simple error detection protocol to a considerable degree. A single randomly generated input is guaranteed to unveil a single error with a probability that only depends on the size of the error, not the size of the circuit itself. Empirical studies confirm that this behavior typically extends to multiple errors as well. In conclusion, reversible circuits offer characteristics that reduce masking effects -- a desirable feature that is in stark contrast to irreversible circuit architectures.

Read more
Hardware Architecture

Chasing Carbon: The Elusive Environmental Footprint of Computing

Given recent algorithm, software, and hardware innovation, computing has enabled a plethora of new applications. As computing becomes increasingly ubiquitous, however, so does its environmental impact. This paper brings the issue to the attention of computer-systems researchers. Our analysis, built on industry-reported characterization, quantifies the environmental effects of computing in terms of carbon emissions. Broadly, carbon emissions have two sources: operational energy consumption, and hardware manufacturing and infrastructure. Although carbon emissions from the former are decreasing thanks to algorithmic, software, and hardware innovations that boost performance and power efficiency, the overall carbon footprint of computer systems continues to grow. This work quantifies the carbon output of computer systems to show that most emissions related to modern mobile and data-center equipment come from hardware manufacturing and infrastructure. We therefore outline future directions for minimizing the environmental impact of computing systems.

Read more
Hardware Architecture

Closed-Loop Neural Interfaces with Embedded Machine Learning

Neural interfaces capable of multi-site electrical recording, on-site signal classification, and closed-loop therapy are critical for the diagnosis and treatment of neurological disorders. However, deploying machine learning algorithms on low-power neural devices is challenging, given the tight constraints on computational and memory resources for such devices. In this paper, we review the recent developments in embedding machine learning in neural interfaces, with a focus on design trade-offs and hardware efficiency. We also present our optimized tree-based model for low-power and memory-efficient classification of neural signal in brain implants. Using energy-aware learning and model compression, we show that the proposed oblique trees can outperform conventional machine learning models in applications such as seizure or tremor detection and motor decoding.

Read more
Hardware Architecture

Cognitive Computing in Data-centric Paradigm

Knowledge is the most precious asset of humankind. People extract the experience from the data that provide for us the reality through the feelings. Generally speaking, it is possible to see the analogy of knowledge elaboration between humankind's way and the artificial system's way. Digital data are the "feelings" of an artificial system, and it needs to invent a method of extraction of knowledge from the Universe of data. The cognitive computing paradigm implies that a system should be able to extract the knowledge from raw data without any human-made algorithm. The first step of the paradigm is analysis of raw data streams through the discovery of repeatable patterns of data. The knowledge of relationships among the patterns provides a way to see the structures and to generalize the concepts with the goal to synthesize new statements. The cognitive computing paradigm is capable of mimicking the human's ability to generalize the notions. It is possible to say that the generalization step provides the basis for discovering the abstract notions, revealing the abstract relations of patterns and general rules of structure synthesis. If anyone continues the process of structure generalization, then it is possible to build the multi-level hierarchy of abstract notions. Moreover, discovering the generalized classes of notions is the first step towards a paradigm of artificial analytical thinking. The most critical possible responsibility of cognitive computing could be the classification of data and recognition of input data stream's states. The synthesis of new statements creates the foundation for the foreseeing the possible data states and elaboration of knowledge about new data classes by employing synthesis and checking the hypothesis.

Read more

Ready to get started?

Join us today