Featured Researches

Hardware Architecture

ALIGN: A System for Automating Analog Layout

ALIGN ("Analog Layout, Intelligently Generated from Netlists") is an open-source automatic layout generation flow for analog circuits. ALIGN translates an input SPICE netlist to an output GDSII layout, specific to a given technology, as specified by a set of design rules. The flow first automatically detects hierarchies in the circuit netlist and translates layout synthesis to a problem of hierarchical block assembly. At the lowest level, parameterized cells are generated using an abstraction of the design rules; these blocks are then assembled under geometric and electrical constraints to build the circuit layout. ALIGN has been applied to generate layouts for a diverse set of analog circuit families: low frequency analog blocks, wireline circuits, wireless circuits, and power delivery circuits.

Read more
Hardware Architecture

AMOEBA: A Coarse Grained Reconfigurable Architecture for Dynamic GPU Scaling

Different GPU applications exhibit varying scalability patterns with network-on-chip (NoC), coalescing, memory and control divergence, and L1 cache behavior. A GPU consists of several StreamingMulti-processors (SMs) that collectively determine how shared resources are partitioned and accessed. Recent years have seen divergent paths in SM scaling towards scale-up (fewer, larger SMs) vs. scale-out (more, smaller SMs). However, neither scaling up nor scaling out can meet the scalability requirement of all applications running on a given GPU system, which inevitably results in performance degradation and resource under-utilization for some applications. In this work, we investigate major design parameters that influence GPU scaling. We then propose AMOEBA, a solution to GPU scaling through reconfigurable SM cores. AMOEBA monitors and predicts application scalability at run-time and adjusts the SM configuration to meet program requirements. AMOEBA also enables dynamic creation of heterogeneous SMs through independent fusing or splitting. AMOEBA is a microarchitecture-based solution and requires no additional programming effort or custom compiler support. Our experimental evaluations with application programs from various benchmark suites indicate that AMOEBA is able to achieve a maximum performance gain of 4.3x, and generates an average performance improvement of 47% when considering all benchmarks tested.

Read more
Hardware Architecture

ANDROMEDA: An FPGA Based RISC-V MPSoC Exploration Framework

With the growing demands of consumer electronic products, the computational requirements are increasing exponentially. Due to the applications' computational needs, the computer architects are trying to pack as many cores as possible on a single die for accelerated execution of the application program codes. In a multiprocessor system-on-chip (MPSoC), striking a balance among the number of cores, memory subsystems, and network-on-chip parameters is essential to attain the desired performance. In this paper, we present ANDROMEDA, a RISC-V based framework that allows us to explore the different configurations of an MPSoC and observe the performance penalties and gains. We emulate the various configurations of MPSoC on the Synopsys HAPS-80D Dual FPGA platform. Using STREAM, matrix multiply, and N-body simulations as benchmarks, we demonstrate our framework's efficacy in quickly identifying the right parameters for efficient execution of these benchmarks.

Read more
Hardware Architecture

ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient Iterative Compute

Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed- or floating-point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the algorithm for enough iterations or because the arithmetic in some iterations was insufficiently precise? There is no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to run for a few more iterations. We propose a fundamentally new approach: with the appropriate arithmetic able to generate results from most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number of algorithmic iterations to arbitrary precision, with both precision and approximant index increasing in lockstep. Consequently, datapaths constructed following our principles demonstrate efficiency over their traditional arithmetic equivalents where the latter's precisions are either under- or over-budgeted for the computation of a result to a particular accuracy. Use of most-significant digit-first arithmetic additionally allows us to declare certain digits to be stable at runtime, avoiding their recalculation in subsequent iterations and thereby increasing performance and decreasing memory footprints. Versus arbitrary-precision iterative solvers without the optimisations we detail herein, we achieve up-to 16 × performance speedups and 1.9x memory savings for the evaluated benchmarks.

Read more
Hardware Architecture

AXES: Approximation Manager for Emerging Memory Architectures

Memory approximation techniques are commonly limited in scope, targeting individual levels of the memory hierarchy. Existing approximation techniques for a full memory hierarchy determine optimal configurations at design-time provided a goal and application. Such policies are rigid: they cannot adapt to unknown workloads and must be redesigned for different memory configurations and technologies. We propose AXES: the first self-optimizing runtime manager for coordinating configurable approximation knobs across all levels of the memory hierarchy. AXES continuously updates and optimizes its approximation management policy throughout runtime for diverse workloads. AXES optimizes the approximate memory configuration to minimize power consumption without compromising the quality threshold specified by application developers. AXES can (1) learn a policy at runtime to manage variable application quality of service (QoS) constraints, (2) automatically optimize for a target metric within those constraints, and (3) coordinate runtime decisions for interdependent knobs and subsystems. We demonstrate AXES' ability to efficiently provide functions 1-3 on a RISC-V Linux platform with approximate memory segments in the on-chip cache and main memory. We demonstrate AXES' ability to save up to 37% energy in the memory subsystem without any design-time overhead. We show AXES' ability to reduce QoS violations by 75% with <5% additional energy.

Read more
Hardware Architecture

AZP: Automatic Specialization for Zero Values in Gaming Applications

Recent research has shown that dynamic zeros in shader programs of gaming applications can be effectively leveraged with a profile-guided, code-versioning transform. This transform duplicates code, specializes one path assuming certain key program operands, called versioning variables, are zero, and leaves the other path unspecialized. Dynamically, depending on the versioning variable's value, either the specialized fast path or the default slow path will execute. Prior work applied this transform manually and showed promising gains on gaming applications. In this paper, we present AZP, an automatic compiler approach to perform the above code-versioning transform. Our framework automatically determines which versioning variables or combinations of them are profitable, and determines the code region to duplicate and specialize (called the versioning scope). AZP takes operand zero value probabilities as input and it then uses classical techniques such as constant folding and dead-code elimination to determine the most profitable versioning variables and their versioning scopes. This information is then used to affect the final transform in a straightforward manner. We demonstrate that AZP is able to achieve an average speedup of 16.4% for targeted shader programs, amounting to an average frame-rate speedup of 3.5% across a collection of modern gaming applications on an NVIDIA GeForce RTX 2080 GPU GPU.

Read more
Hardware Architecture

AccSS3D: Accelerator for Spatially Sparse 3D DNNs

Semantic understanding and completion of real world scenes is a foundational primitive of 3D Visual perception widely used in high-level applications such as robotics, medical imaging, autonomous driving and navigation. Due to the curse of dimensionality, compute and memory requirements for 3D scene understanding grow in cubic complexity with voxel resolution, posing a huge impediment to realizing real-time energy efficient deployments. The inherent spatial sparsity present in the 3D world due to free space is fundamentally different from the channel-wise sparsity that has been extensively studied. We present ACCELERATOR FOR SPATIALLY SPARSE 3D DNNs (AccSS3D), the first end-to-end solution for accelerating 3D scene understanding by exploiting the ample spatial sparsity. As an algorithm-dataflow-architecture co-designed system specialized for spatially-sparse 3D scene understanding, AccSS3D includes novel spatial locality-aware metadata structures, a near-zero latency and spatial sparsity-aware dataflow optimizer, a surface orientation aware pointcloud reordering algorithm and a codesigned hardware accelerator for spatial sparsity that exploits data reuse through systolic and multicast interconnects. The SSpNNA accelerator core together with the 64 KB of L1 memory requires 0.92 mm2 of area in 16nm process at 1 GHz. Overall, AccSS3D achieves 16.8x speedup and a 2232x energy efficiency improvement for 3D sparse convolution compared to an Intel-i7-8700K 4-core CPU, which translates to a 11.8x end-to-end 3D semantic segmentation speedup and a 24.8x energy efficiency improvement (iso technology node)

Read more
Hardware Architecture

Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems with Binary Translation

It has always been difficult to balance the accuracy and performance of ISSs. RTL simulators or systems such as gem5 are used to execute programs in a cycle-accurate manner but are often prohibitively slow. In contrast, functional simulators such as QEMU can run large benchmarks to completion in a reasonable time yet capture few performance metrics and fail to model complex interactions between multiple cores. This paper presents a novel multi-purpose simulator that exploits binary translation to offer fast cycle-level full-system simulations. Its functional simulation mode outperforms QEMU and, if desired, it is possible to switch between functional and timing modes at run-time. Cycle-level simulations of RISC-V multi-core processors are possible at more than 20 MIPS, a useful middle ground in terms of accuracy and performance with simulation speeds nearly 100 times those of more detailed cycle-accurate models.

Read more
Hardware Architecture

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate 12× better minimum latency than a CPU and 2.8× greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU (by up to 16× ) and prior main-memory acceleration approaches (by up to 2.4× compared to the best prior approach).

Read more
Hardware Architecture

Accelerating Bulk Bit-Wise X(N)OR Operation in Processing-in-DRAM Platform

With Von-Neumann computing architectures struggling to address computationally- and memory-intensive big data analytic task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this way, processing-in-DRAM architecture has achieved remarkable success by dramatically reducing data transfer energy and latency. However, the performance of such system unavoidably diminishes when dealing with more complex applications seeking bulk bit-wise X(N)OR- or addition operations, despite utilizing maximum internal DRAM bandwidth and in-memory parallelism. In this paper, we develop DRIM platform that harnesses DRAM as computational memory and transforms it into a fundamental processing unit. DRIM uses the analog operation of DRAM sub-arrays and elevates it to implement bit-wise X(N)OR operation between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. The simulation results show that DRIM achieves on average 71x and 8.4x higher throughput for performing bulk bit-wise X(N)OR-based operations compared with CPU and GPU, respectively. Besides, DRIM outperforms recent processing-in-DRAM platforms with up to 3.7x better performance.

Read more

Ready to get started?

Join us today