Michael Gautschi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Gautschi is active.

Explore More

Publication

Featured researches published by Michael Gautschi.

IEEE Transactions on Circuits and Systems | 2017

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Francesco Conti; Robert Schilling; Pasquale Davide Schiavone; Antonio Pullini; Davide Rossi; Frank K. Gürkaynak; Michael Muehlberghuber; Michael Gautschi; Igor Loi; Germain Haugou; Stefan Mangard; Luca Benini

Near-sensor data analytics is a promising direction for internet-of-things endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data are stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a system-on-chip (SoC) based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65-nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep convolutional neural network (CNN) consuming 3.16pJ per equivalent reduced instruction set computer operation, local CNN-based face detection with secured remote recognition in 5.74pJ/op, and seizure detection with encrypted data collection from electroencephalogram within 12.7pJ/op.

ifip ieee international conference on very large scale integration | 2015

Tailoring instruction-set extensions for an ultra-low power tightly-coupled cluster of OpenRISC cores

Michael Gautschi; Andreas Traber; Antonio Pullini; Luca Benini; Michele Scandale; Alessandro Di Federico; Michele Beretta; Giovanni Agosta

Baseline RISC instruction sets for ultra-low power processors are constantly being tuned to reduce cycle count when executing computation-intensive applications. Performance improvements often come at a non-negligible price in terms of area and critical path length and imply deeper pipelines and complex memory interfaces. This penalizes control-intensive code execution and significantly increases cost and complexity of building multi-core clusters. In addition, some extensions are not easily exploited by compilers and may increase code development effort, especially when considering parallel applications. In this paper we describe our efforts in enhancing a baseline open ISA (OpenRISC) and its LLVM compiler back-end to significantly reduce execution cycles while minimizing the impact on core micro-architecture complexity, number of pipeline stages, area and power. In addition, we improved the core micro-architecture to streamline its integration in a tightly-coupled cluster, sharing instruction cache and data memory, thereby further enhancing parallel execution efficiency. The combined effect of ISA, compiler and micro-architecture evolution gives an average energy efficiency boost of 59% on vector intensive code and 41% otherwise, at an area and power increase of 2.3% and 18% on a four-core processor cluster.

international solid-state circuits conference | 2016

4.6 A 65nm CMOS 6.4-to-29.2pJ/[email protected] shared logarithmic floating point unit for acceleration of nonlinear function kernels in a tightly coupled processor cluster

Michael Gautschi; Michael Schaffner; Frank K. Gürkaynak; Luca Benini

Energy-efficient computing and ultra-low-power operation are requirements for many application areas, such as IoT and wearables. While for some applications, integer and fixed-point processor instructions suffice, others (e.g. simultaneous localization and mapping - SLAM, stereo vision, nonlinear regression and classification) require a larger dynamic range, typically obtained using single/double-precision floating point (FP) instructions. Logarithmic number systems (LNS) have been proposed [1,2] as an energy-efficient alternative to conventional FP, as several complex operations such as MUL, DIV, and EXP translate into simpler arithmetic operations in the logarithmic space and can be efficiently calculated using integer arithmetic units. However, ADD and SUB become nonlinear and have to be approximated by look-up tables (LUTs) and interpolation, which is typically implemented in a dedicated LNS unit (LNU) [1,2]. The area of LNUs grows exponentially with the desired precision, and an LNU with accuracy comparable to IEEE single-precision format is larger than a traditional floating-point unit (FPU). However, we show that in multi-core systems optimized for ultra-low-power operation such as the PULP system [3], one LNU can be efficiently shared in a cluster as indicated in Fig. 4.6.1. This arrangement not only reduces the per-core area overhead, but more importantly, allows several costly operations such as FP MUL/DIV to be processed without contention within the integer cores without additional overhead. We show that for typical nonlinear processing tasks, our LNU design can be up to 4.2× more energy efficient than a private-FP design.

great lakes symposium on vlsi | 2014

Customizing an open source processor to fit in an ultra-low power cluster with a shared L1 memory

Michael Gautschi; Davide Rossi; Luca Benini

The OpenRISC processor core, featuring a flat pipeline and a low area footprint has been integrated in a multi-core ultra-low power (ULP) cluster with a shared multi-banked memory to exploit parallelism in the near-threshold regime. The micro-architecture has been optimized to support a shared L1 memory and to achieve a high value of instructions per cycle (IPC) per core. The proposed architecture achieves IPC results in the range of 0.88 and 1 in a set of benchmark applications which is an improvement of up to 83% with respect to the original OpenRISC implementation. Implemented in 28nm FDSOI technology, the proposed design achieves 177 MOPS when supplied at 0.6V near-threshold voltage. The energy efficiency at this workload is 90.07 MOPS/mW which is an improvement of 50% with respect to what can be achieved with an OpenRISC cluster based on the original micro-architecture.

2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX) | 2016

193 MOPS/mW @ 162 MOPS, 0.32V to 1.15V voltage range multi-core accelerator for energy efficient parallel and sequential digital processing

Davide Rossi; Antonio Pullini; Igor Loi; Michael Gautschi; Frank K. Gürkaynak; Adam Teman; Jeremy Constantin; Andreas Burg; Ivan Miro-Panades; Edith Beigne; Fabien Clermidy; Philippe Flatresse; Luca Benini

Low power (mW) and high performance (GOPS) are strong requirements for compute-intensive signal processing in E-health, Internet-of-Things, and wearable applications. This work presents a building block for programmable Ultra-Low Power accelerators, namely a tightly-coupled computing cluster that supports parallel and sequential execution at high energy efficiency over a wide range of workload requirements. The cluster, implemented in 28nm UTBB FD-SOI technology, achieves peak energy efficiency in the near-threshold (NVT) operating region: 193 MOPS/mW at 162 MOPS for parallel workloads, and 90 MOPS/mW at 68 MOPS for sequential workloads at 0.46V and 0.5V, respectively. The energy efficient operating range is wide (0.32V to 1.15V), also meeting the design goal of 1 GOPS within a 10 mW power envelope (at 0.66V).

european solid state circuits conference | 2016

Approximate 32-bit floating-point unit design with 53% power-area product reduction

Vincent Camus; Jeremy Schlachter; Christian Enz; Michael Gautschi; Frank K. Gürkaynak

The floating-point unit is one of the most common building block in any computing system and is used for a huge number of applications. By combining two state-of-the-art techniques of imprecise hardware, namely Gate-Level Pruning and Inexact Speculative Adder, and by introducing a novel Inexact Speculative Multiplier architecture, three different approximate FPUs and one reference IEEE-754 compliant FPU have been integrated in a 65 nm CMOS process within a low-power multi-core processor. Silicon measurements show up to 27% power, 36% area and 53 % power-area product savings compared to the IEEE-754 single-precision FPU. Accuracy loss has been evaluated with a high-dynamic-range image tone-mapping algorithm, resulting in small but non-visible errors with image PSNR value of 90 dB.

ieee soi 3d subthreshold microelectronics technology unified conference | 2015

A −1.8V to 0.9V body bias, 60 GOPS/W 4-core cluster in low-power 28nm UTBB FD-SOI technology

Davide Rossi; Antonio Pullini; Michael Gautschi; Igor Loi; Frank K. Gürkaynak; Philippe Flatresse; Luca Benini

A 4-core cluster fabricated in low power 28nm UTBB FD-SOI conventional well technology is presented. The SoC architecture enables the processors to operate “on-demand” on a 0.44V (1.8MHz) to 1.2V (475MHz) supply voltage wide range and -1.2V to 0.9V body bias wide range achieving the peak energy efficiency of 60 GOPS/W, (419μW, 6.4MHz) at 0.5V with 0.5V forward body bias. The proposed SoC energy efficiency is 1.4x to 3.7x greater than other low-power processors with comparable performance.

computing frontiers | 2015

Exploring multi-banked shared-L1 program cache on ultra-low power, tightly coupled processor clusters

Igor Loi; Davide Rossi; Germain Haugou; Michael Gautschi; Luca Benini

L1 instruction caches in many-core systems represent a sizable fraction of the total power consumption. Although large instruction caches can significantly improve performance, they have the potential to increase power consumption. Private caches are usually able to achieve higher speed, due to their simpler design, but the smaller L1 memory space seen by each core induces a high miss ratio. Shared instruction cache can be seen as an attractive solution to improve performance and energy efficiency while reducing area. In this paper we propose a multi-banked, shared instruction cache architecture suitable for ultra-low power multicore systems, where parallelism and near threshold operation is used to achieve minimum energy. We implemented the cluster architecture with different configurations of cache sharing, utilizing the 28nm UTBB FD-SOI from STMicroelectronics as reference technology. Experimental results, based on several real-life applications, demonstrate that sharing mechanisms have no impact on the system operating frequency, and allow to reduce the energy consumption of the cache subsystem by up to 10%, while keeping the same area footprint, or reducing by 2× the overall shared cache area, while keeping the same performance and energy efficiency with respect to a cluster of processing elements with private program caches.

international symposium on circuits and systems | 2016

A heterogeneous multi-core system-on-chip for energy efficient brain inspired vision

Antonio Pullini; Francesco Conti; Davide Rossi; Igor Loi; Michael Gautschi; Luca Benini

Computer vision (CV) based on Convolutional Neural Networks (CNN) is a rapidly developing field thanks to CNNs flexibility, strong generalization capability and classification accuracy (matching and sometimes exceeding human performance). CNN-based classifiers are typically deployed on servers or high-end embedded platforms. However, their ability to “compress” low information density data such as images into highly informative classification tags makes them extremely interesting for wearable and IoT scenarios, should it be possible to fit their computational requirements within deeply embedded devices such as visual sensor nodes. We propose a 65nm system-on-chip implementing a hybrid HW/SW CNN accelerator while meeting this energy efficiency target. The SoC integrates a near-threshold parallel processor cluster [1] and a hardware accelerator for convolution-accumulation operations [2], which constitute the basic kernel of CNNs: it achieves peak performance of 11.2 GMAC/s @ 1.2 V and peak energy efficiency of 261 GMAC/s/W @ 0.65V.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2018

A Heterogeneous Multicore System on Chip for Energy Efficient Brain Inspired Computing

Antonio Pullini; Francesco Conti; Davide Rossi; Igor Loi; Michael Gautschi; Luca Benini

Convolutional neural networks (CNNs) have revolutionized computer vision, speech recognition, and other fields requiring strong classification capabilities. These strengths make CNNs appealing in edge node Internet-of-Things (IoT) applications requiring near-sensors processing. Specialized CNN accelerators deliver significant performance per watt and satisfy the tight constraints of deeply embedded devices, but they cannot be used to implement arbitrary CNN topologies or nonconventional sensory algorithms where CNNs are only a part of the processing stack. A higher level of flexibility is desirable for next generation IoT nodes. Here, we present Mia Wallace, a 65-nm system-on-chip integrating a near-threshold parallel processor cluster tightly coupled with a CNN accelerator: it achieves peak energy efficiency of 108 GMAC/s/W at 0.72 V and peak performance of 14 GMAC/s at 1.2 V, leaving 1.2 GMAC/s available for general-purpose parallel processing.

Explore More