Featured Researches

Hardware Architecture

3D IC optimal layout design. A parallel and distributed topological approach

The task of 3D ICs layout design involves the assembly of millions of components taking into account many different requirements and constraints such as topological, wiring or manufacturability ones. It is a NP-hard problem that requires new non-deterministic and heuristic algorithms. Considering the time complexity, the commonly applied Fiduccia-Mattheyses partitioning algorithm is superior to any other local search method. Nevertheless, it can often miss to reach a quasi-optimal solution in 3D spaces. The presented approach uses an original 3D layout graph partitioning heuristics implemented with use of the extremal optimization method. The goal is to minimize the total wire-length in the chip. In order to improve the time complexity a parallel and distributed Java implementation is applied. Inside one Java Virtual Machine separate optimization algorithms are executed by independent threads. The work may also be shared among different machines by means of The Java Remote Method Invocation system.

Read more
Hardware Architecture

4-Bit High-Speed Binary Ling Adder

Binary addition is one of the most primitive and most commonly used applications in computer arithmetic. A large variety of algorithms and implementations have been proposed for binary addition. Huey Ling proposed a simpler form of CLA equations which rely on adjacent pair bits. Along with bit generate and bit propagate, we introduce another prefix bit, the half sum bit. Ling adder increases the speed of n-bit binary addition, which is an upgrade from the existing Carry-Look-Ahead adder. Several variants of the carry look-ahead equations, like Ling carries, have been presented that simplify carry computation and can lead to faster structures. Ling adders, make use of Ling carry and propagate bits, in order to calculate the sum bit. As a result, dependency on the previous bit addition is reduced; that is, ripple effect is lowered. This paper provides a comparative study on the implementation of the above mentioned high-speed adders.

Read more
Hardware Architecture

A 68 uW 31 kS/s Fully-Capacitive Noise-Shaping SAR ADC with 102 dB SNDR

This paper presents a 17 bit analogue-to-digital converter that incorporates mismatch and quantisation noise-shaping techniques into an energy-saving 10 bit successive approximation quantiser to increase the dynamic range by another 42 dB. We propose a novel fully-capacitive topology which allows for high-speed asynchronous conversion together with a background calibration scheme to reduce the oversampling requirement by 10x compared to prior-art. A 0.18 um CMOS technology is used to demonstrate preliminary simulation results together with analytic measures that optimise parameter and topology selection. The proposed system is able to achieve a FoMS of 183 dB for a maximum signal bandwidth of 15.6 kHz while dissipating 68 uW from a 1.8 V supply. A peak SNDR of 102 dB is demonstrated for this rate with a 0.201 mm^2 area requirement.

Read more
Hardware Architecture

A Case for Reversible Coherence Protocol

We propose the first Reversible Coherence Protocol (RCP), a new protocol designed from ground up that enables invisible speculative load. RCP takes a bold approach by including the speculative loads and merge/purge operation in the interface between processor and cache coherence, and allowing them to participate in the coherence protocol. It means, speculative load, ordinary load/store, and merge/purge can all affect the state of a given cache line. RCP is the first coherence protocol that enables the commit and squash of the speculative load among distributed cache components in a general memory hierarchy. RCP incurs an average slowdown of (3.0%, 7.4%) on (SPEC,PARSEC), compared to (26.5%,18.3%) in InvisiSpec and (3.2%,24.2%) in CleanupSpec. The coherence traffic overhead is on average 46%, compared to 40% and 27% of InvisiSpec and CleanupSpec, respectively. Even with higher traffic overhead (~ 46%), the performance overhead of RCP is lower than InvisiSpec and comparable to CleanupSpec. It reveals a key advantage of RCP: the coherence actions triggered by the merge and purge operations are not in the critical path of the execution and can be performed in the cache hierarchy concurrently with processor execution.

Read more
Hardware Architecture

A Case for Superconducting Accelerators

As the scaling of conventional CMOS-based technologies slows down, there is growing interest in alternative technologies that can improve performance or energy-efficiency. Superconducting circuits based on Josephson Junction (JJ) is an emerging technology that can provide devices which can be switched with pico-second latencies and consuming two orders of magnitude lower switching energy compared to CMOS. While JJ-based circuits can provide high operating frequency and energy-efficiency, this technology faces three critical challenges: limited device density and lack of area-efficient technology for memory structures, reduced gate fanout compared to CMOS, and new failure modes of Flux-Traps that occurs due to the operating environment. The lack of dense memory technology restricts the use of superconducting technology in the near term to application domains that have high compute intensity but require negligible amount of memory. In this paper, we study the use of superconducting technology to build an accelerator for SHA-256 engines commonly used in Bitcoin mining applications. We show that merely porting existing CMOS-based accelerator to superconducting technology provides 10.6X improvement in energy efficiency. Redesigning the accelerator to suit the unique constraints of superconducting technology (such as low fanout) improves the energy efficiency to 12.2X. We also investigate solutions to make the accelerator tolerant of new fault modes and show how this fault-tolerant design can be leveraged to reduce the operating current, thereby increasing the overall energy-efficiency to 46X compared to CMOS. Our paper also develops a workflow for evaluating area, performance, and power for accelerators built in superconducting technology, and this workflow can help other researchers explore designs using this technology.

Read more
Hardware Architecture

A Comparative Study between HLS and HDL on SoC for Image Processing Applications

The increasing complexity in today's systems and the limited market times demand new development tools for FPGA. Currently, in addition to traditional hardware description languages (HDLs), there are high-level synthesis (HLS) tools that increase the abstraction level in system development. Despite the greater simplicity of design and testing, HLS has some drawbacks in describing harware. This paper presents a comparative study between HLS and HDL for FPGA, using a Sobel filter as a case study in the image processing field. The results show that the HDL implementation is slightly better than the HLS version considering resource usage and response time. However, the programming effort required in the HDL solution is significantly larger than in the HLS counterpart.

Read more
Hardware Architecture

A Compiler Infrastructure for FPGA and ASIC Development

This whitepaper proposes a unified framework for hardware design tools to ease the development and inter-operability of said tools. By creating a large ecosystem of hardware development tools across vendors, academia, and the open source community, we hope to significantly increase much need productivity in hardware design.

Read more
Hardware Architecture

A Custom 7nm CMOS Standard Cell Library for Implementing TNN-based Neuromorphic Processors

A set of highly-optimized custom macro extensions is developed for a 7nm CMOS cell library for implementing Temporal Neural Networks (TNNs) that can mimic brain-like sensory processing with extreme energy efficiency. A TNN prototype (13,750 neurons and 315,000 synapses) for MNIST requires only 1.56mm2 die area and consumes only 1.69mW.

Read more
Hardware Architecture

A Fast Finite Field Multiplier for SIKE

Various post-quantum cryptography algorithms have been recently proposed. Supersingluar isogeny Diffie-Hellman key exchange (SIKE) is one of the most promising candidates due to its small key size. However, the SIKE scheme requires numerous finite field multiplications for its isogeny computation, and hence suffers from slow encryption and decryption process. In this paper, we propose a fast finite field multiplier design that performs multiplications in GF(p) with high throughput and low latency. The design accelerates the computation by adopting deep pipelining, and achieves high hardware utilization through data interleaving. The proposed finite field multiplier demonstrates 4.48 times higher throughput than prior work based on the identical fast multiplication algorithm and 1.43 times higher throughput than the state-of-the-art fast finite field multiplier design aimed at SIKE.

Read more
Hardware Architecture

A GPU Register File using Static Data Compression

GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper introduces a new register file organization for efficient register-packing of narrow integer and floating-point operands designed to leverage on advances in static analysis. We show that the hardware/software co-designed register file organization yields a performance improvement of up to 79%, and 18.6%, on average, at a modest output-quality degradation.

Read more

Ready to get started?

Join us today