Featured Researches

Hardware Architecture

Appearances of the Birthday Paradox in High Performance Computing

We give an elementary statistical analysis of two High Performance Computing issues, processor cache mapping and network port mapping. In both cases we find that, as in the birthday paradox, random assignment leads to more frequent coincidences than one expects a priori. Since these correspond to contention for limited resources, this phenomenon has important consequences for performance.

Read more
Hardware Architecture

Applicability of Partial Ternary Full Adder in Ternary Arithmetic Units

This paper explores whether or not a complete ternary full adder, whose input variables can independently be '0', '1', or '2', is indispensable in the arithmetic blocks of adder, subtractor, and multiplier. Our investigations show that none of the mentioned arithmetic units require a complete ternary full adder. Instead, they can be designed by use of partial ternary full adder, whose input carry never becomes '2'. Furthermore, some new ternary compressors are proposed in this paper without the requirement of complete ternary full adder. The usage of partial ternary full adder can help circuit designers to simplify their designs, especially in transistor level.

Read more
Hardware Architecture

ApproxFPGAs: Embracing ASIC-Based Approximate Arithmetic Components for FPGA-Based Systems

There has been abundant research on the development of Approximate Circuits (ACs) for ASICs. However, previous studies have illustrated that ASIC-based ACs offer asymmetrical gains in FPGA-based accelerators. Therefore, an AC that might be pareto-optimal for ASICs might not be pareto-optimal for FPGAs. In this work, we present the ApproxFPGAs methodology that uses machine learning models to reduce the exploration time for analyzing the state-of-the-art ASIC-based ACs to determine the set of pareto-optimal FPGA-based ACs. We also perform a case-study to illustrate the benefits obtained by deploying these pareto-optimal FPGA-based ACs in a state-of-the-art automation framework to systematically generate pareto-optimal approximate accelerators that can be deployed in FPGA-based systems to achieve high performance or low-power consumption.

Read more
Hardware Architecture

Approximate Logic Synthesis: A Reinforcement Learning-Based Technology Mapping Approach

Approximate Logic Synthesis (ALS) is the process of synthesizing and mapping a given Boolean network to a library of logic cells so that the magnitude/rate of error between outputs of the approximate and initial (exact) Boolean netlists is bounded from above by a predetermined total error threshold. In this paper, we present Q-ALS, a novel framework for ALS with focus on the technology mapping phase. Q-ALS incorporates reinforcement learning and utilizes Boolean difference calculus to estimate the maximum error rate that each node of the given network can tolerate such that the total error rate at non of the outputs of the mapped netlist exceeds a predetermined maximum error rate, and the worst case delay and the total area are minimized. Maximum Hamming Distance (MHD) between exact and approximate truth tables of cuts of each node is used as the error metric. In Q-ALS, a Q-Learning agent is trained with a sufficient number of iterations aiming to select the fittest values of MHD for each node, and in a cut-based technology mapping approach, the best supergates (in terms of delay and area, bounded further by the fittest MHD) are selected towards implementing each node. Experimental results show that having set the required accuracy of 95% at the primary outputs, Q-ALS reduces the total cost in terms of area and delay by up to 70% and 36%, respectively, and also reduces the run-time by 2.21 times on average, when compared to the best state-of-the-art academic ALS tools.

Read more
Hardware Architecture

ArSMART: An Improved SMART NoC Design Supporting Arbitrary-Turn Transmission

SMART NoC, which transmits unconflicted flits to distant processing elements (PEs) in one cycle through the express bypass, is a high-performance NoC design proposed recently. However, if contention occurs, flits with low priority would not only be buffered but also could not fully utilize bypass. Although there exist several routing algorithms that decrease contentions by rounding busy routers and links, they cannot be directly applicable to SMART since it lacks the support for arbitrary-turn (i.e., the number and direction of turns are free of constraints) routing. Thus, in this article, to minimize contentions and further utilize bypass, we propose an improved SMART NoC, called ArSMART, in which arbitrary-turn transmission is enabled. Specifically, ArSMART divides the whole NoC into multiple clusters where the route computation is conducted by the cluster controller and the data forwarding is performed by the bufferless reconfigurable router. Since the long-range transmission in SMART NoC needs to bypass the intermediate arbitration, to enable this feature, we directly configure the input and output ports connection rather than apply hop-by-hop table-based arbitration. To further explore the higher communication capabilities, effective adaptive routing algorithms that are compatible with ArSMART are proposed. The route computation overhead, one of the main concerns for adaptive routing algorithms, is hidden by our carefully designed control mechanism. Compared with the state-of-the-art SMART NoC, the experimental results demonstrate an average reduction of 40.7% in application schedule length and 29.7% in energy consumption.

Read more
Hardware Architecture

Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI

In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41 DP-GFLOPS/W under the same conditions, which is slightly superior to similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.

Read more
Hardware Architecture

Architectural Analysis of FPGA Technology Impact

The use of high-level languages for designing hardware is gaining popularity since they increase design productivity by providing higher abstractions. However, one drawback of such abstraction level has been the difficulty of relating the low-level implementation problems back to the original high-level design, which is paramount for architectural optimization. In this work (developed between April 2013 and April 2014), we propose a methodology to analyze the effects of technology over the architecture, and to generate architectural-level area, delay and power metrics. Such feedback allows the designer to quickly gauge the impact of architectural decisions on the quality of generated hardware and opens the door to automatic architectural analysis. We demonstrate the use of our technique on three FPGA platforms using two designs: a Reed-Solomon error correction decoder and a 32-bit pipelined processor implementation.

Read more
Hardware Architecture

Architectural Implications of Graph Neural Networks

Graph neural networks (GNN) represent an emerging line of deep learning models that operate on graph structures. It is becoming more and more popular due to its high accuracy achieved in many graph-related tasks. However, GNN is not as well understood in the system and architecture community as its counterparts such as multi-layer perceptrons and convolutional neural networks. This work tries to introduce the GNN to our community. In contrast to prior work that only presents characterizations of GCNs, our work covers a large portion of the varieties for GNN workloads based on a general GNN description framework. By constructing the models on top of two widely-used libraries, we characterize the GNN computation at inference stage concerning general-purpose and application-specific architectures and hope our work can foster more system and architecture research for GNNs.

Read more
Hardware Architecture

Architecture Support for FPGA Multi-tenancy in the Cloud

Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale+ demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization ( we achieved 6x higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about 2x higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbps.

Read more
Hardware Architecture

Architecture, Dataflow and Physical Design Implications of 3D-ICs for DNN-Accelerators

The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level of spatial parallelism. Therefore, we analyze dataflows, performance, area, power and temperature of such 3D-DNN-accelerators. Monolithic and TSV-based stacked 3D-ICs are compared against 2D-ICs. We identify workload properties and architectural parameters for efficient 3D-ICs and achieve up to 9.14x speedup of 3D vs. 2D. We discuss area-performance trade-offs. We demonstrate applicability as the 3D-IC draws similar power as 2D-ICs and is not thermal limited.

Read more

Ready to get started?

Join us today