Featured Researches

Hardware Architecture

An Investigation on Inherent Robustness of Posit Data Representation

As the dimensions and operating voltages of computer electronics shrink to cope with consumers' demand for higher performance and lower power consumption, circuit sensitivity to soft errors increases dramatically. Recently, a new data-type is proposed in the literature called posit data type. Posit arithmetic has absolute advantages such as higher numerical accuracy, speed, and simpler hardware design than IEEE 754-2008 technical standard-compliant arithmetic. In this paper, we propose a comparative robustness study between 32-bit posit and 32-bit IEEE 754-2008 compliant representations. At first, we propose a theoretical analysis for IEEE 754 compliant numbers and posit numbers for single bit flip and double bit flips. Then, we conduct exhaustive fault injection experiments that show a considerable inherent resilience in posit format compared to classical IEEE 754 compliant representation. To show a relevant use-case of fault-tolerant applications, we perform experiments on a set of machine-learning applications. In more than 95% of the exhaustive fault injection exploration, posit representation is less impacted by faults than the IEEE 754 compliant floating-point representation. Moreover, in 100% of the tested machine-learning applications, the accuracy of posit-implemented systems is higher than the classical floating-point-based ones.

Read more
Hardware Architecture

An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication

On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research area. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art, industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only 24 ns round-trip latency between any two cores.

Read more
Hardware Architecture

An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications

The conventional von Neumann architecture has been revealed as a major performance and energy bottleneck for rising data-intensive applications. %, due to the intensive data movements. The decade-old idea of leveraging in-memory processing to eliminate substantial data movements has returned and led extensive research activities. The effectiveness of in-memory processing heavily relies on memory scalability, which cannot be satisfied by traditional memory technologies. Emerging non-volatile memories (eNVMs) that pose appealing qualities such as excellent scaling and low energy consumption, on the other hand, have been heavily investigated and explored for realizing in-memory processing architecture. In this paper, we summarize the recent research progress in eNVM-based in-memory processing from various aspects, including the adopted memory technologies, locations of the in-memory processing in the system, supported arithmetics, as well as applied applications.

Read more
Hardware Architecture

An efficient floating point multiplier design for high speed applications using Karatsuba algorithm and Urdhva-Tiryagbhyam algorithm

Floating point multiplication is a crucial operation in high power computing applications such as image processing, signal processing etc. And also multiplication is the most time and power consuming operation. This paper proposes an efficient method for IEEE 754 floating point multiplication which gives a better implementation in terms of delay and power. A combination of Karatsuba algorithm and Urdhva-Tiryagbhyam algorithm (Vedic Mathematics) is used to implement unsigned binary multiplier for mantissa multiplication. The multiplier is implemented using Verilog HDL, targeted on Spartan-3E and Virtex-4 FPGA.

Read more
Hardware Architecture

Analysis and Design of a 32nm FinFET Dynamic Latch Comparator

Comparators have multifarious applications in various fields, especially used in analog to digital converters. Over the years, we have seen many different designs of single stage, dynamic latch type and double tail type comparators based on CMOS technology, and all of them had to make the tradeoff between power consumption and delay time. Meanwhile, to mitigate the short channel effects of conventional CMOS based design, FinFET has emerged as the most promising alternative by owning the tremendous gate control feature over the channel region. In this paper, we have analyzed the performance of some recent dynamic latch type comparators and proposed a new structure of dynamic latch comparator; moreover, 32nm FinFET technology has been considered as the common platform for all of the comparators circuit design. The proposed comparator has shown impressive performance in case of power consumption, time delay, power delay product and offset voltage while compared with the other recent comparators through simulations with LTspice.

Read more
Hardware Architecture

Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device

Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these options can have inadvertent effects on the achieved bandwidths depending on applications and data access patterns. To provide the most efficient communications between CPUs and accelerators, understanding the data transaction behaviors and selecting the right I/O cache coherence method is essential. In this paper, we use Xilinx Zynq UltraScale+ as the SoC platform to show how certain I/O cache coherence method can perform better or worse in different situations, ultimately affecting the overall accelerator performances as well. Based on our analysis, we further explore possible software and hardware modifications to improve the I/O performances with different I/O cache coherence options. With our proposed modifications, the overall performance of SoC design can be averagely improved by 20%.

Read more
Hardware Architecture

Analysis of Energy Consumption in a Precision Beekeeping System

Honey bees have been domesticated by humans for several thousand years and mainly provide honey and pollination, which is fundamental for plant reproduction. Nowadays, the work of beekeepers is constrained by external factors that stress their production (parasites and pesticides among others). Taking care of large numbers of beehives is time-consuming, so integrating sensors to track their status can drastically simplify the work of beekeepers. Precision bee-keeping complements beekeepers' work thanks to the In-ternet of Things (IoT) technology. If used correctly, data can help to make the right diagnosis for honey bees colony, increase honey production and decrease bee mortality. Providing enough energy for on-hive and in-hive sensors is a challenge. Some solutions rely on energy harvesting, others target usage of large batteries. Either way, it is mandatory to analyze the energy usage of embedded equipment in order to design an energy efficient and autonomous bee monitoring system. This paper relies on a fully autonomous IoT framework that collects environmental and image data of a beehive. It consists of a data collecting node (environmental data sensors, camera, Raspberry Pi and Arduino) and a solar energy supplying node. Supported services are analyzed task by task from an energy profiling and efficiency standpoint , in order to identify the highly pressured areas of the framework. This first step will guide our goal of designing a sustainable precision beekeeping system, both technically and energy-wise.

Read more
Hardware Architecture

Analytical Model of Memory-Bound Applications Compiled with High Level Synthesis

The increasing demand of dedicated accelerators to improve energy efficiency and performance has highlighted FPGAs as a promising option to deliver both. However, programming FPGAs in hardware description languages requires long time and effort to achieve optimal results, which discourages many programmers from adopting this technology. High Level Synthesis tools improve the accessibility to FPGAs, but the optimization process is still time expensive due to the large compilation time, between minutes and days, required to generate a single bitstream. Whereas placing and routing take most of this time, the RTL pipeline and memory organization are known in seconds. This early information about the organization of the upcoming bitstream is enough to provide an accurate and fast performance model. This paper presents a performance analytical model for HLS designs focused on memory bound applications. With a careful analysis of the generated memory architecture and DRAM organization, the model predicts the execution time with a maximum error of 9.2% for a set of representative applications. Compared with previous works, our predictions reduce on average at least 2× the estimation error.

Read more
Hardware Architecture

Analytical Modeling the Multi-Core Shared Cache Behavior with Considerations of Data-Sharing and Coherence

To mitigate the ever worsening "Power wall" and "Memory wall" problems, multi-core architectures with multilevel cache hierarchies have been widely accepted in modern processors. However, the complexity of the architectures makes modeling of shared caches extremely complex. In this paper, we propose a data-sharing aware analytical model for estimating the miss rates of the downstream shared cache under multi-core scenarios. Moreover, the proposed model can also be integrated with upstream cache analytical models with the consideration of multi-core private cache coherent effects. This integration avoids time consuming full simulations of the cache architecture that required by conventional approaches. We validate our analytical model against gem5 simulation results under 13 applications from PARSEC 2.1 benchmark suites. Compared to the results from gem5 simulations under 8 hardware configurations including dual-core and quad-core architectures, the average absolute error of the predicted shared L2 cache miss rates is less than 2% for all configurations. After integrated with the refined upstream model with coherence misses, the overall average absolute error in 4 hardware configurations is degraded to 8.03% due to the error accumulations. The proposed coherence model can achieve similar accuracies of state of the art approach with only one tenth time overhead. As an application case of the integrated model, we also evaluate the miss rates of 57 different multi-core and multi-level cache configurations.

Read more
Hardware Architecture

Analytical models of Energy and Throughput for Caches in MPSoCs

General trends in computer architecture are shifting more towards parallelism. Multicore architectures have proven to be a major step in processor evolution. With the advancement in multicore architecture, researchers are focusing on finding different solutions to fully utilize the power of multiple cores. With an ever-increasing number of cores on a chip, the role of cache memory has become pivotal. An ideal memory configuration should be both large and fast, however, in fact, system architects have to strike a balance between the size and access time of the memory hierarchy. It is important to know the impact of a particular cache configuration on the throughput and energy consumption of the system at design time. This paper presents an enhanced version of previously proposed cache energy and throughput models for multicore systems. These models use significantly a smaller number of input parameters as compared to other models. This paper also validates the proposed models through cycle accurate simulator and a renowned processor power estimator. The results show that the proposed energy models provide accuracy within a maximum error range of 10% for single-core processors and around 5% for MPSoCs, and the throughput models result in a maximum error of up to 11.5% for both single and multicore architectures.

Read more

Ready to get started?

Join us today