Michael Schaffner | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Schaffner is active.

Explore More

Publication

Featured researches published by Michael Schaffner.

international solid-state circuits conference | 2016

4.6 A 65nm CMOS 6.4-to-29.2pJ/[email protected] shared logarithmic floating point unit for acceleration of nonlinear function kernels in a tightly coupled processor cluster

Michael Gautschi; Michael Schaffner; Frank K. Gürkaynak; Luca Benini

Energy-efficient computing and ultra-low-power operation are requirements for many application areas, such as IoT and wearables. While for some applications, integer and fixed-point processor instructions suffice, others (e.g. simultaneous localization and mapping - SLAM, stereo vision, nonlinear regression and classification) require a larger dynamic range, typically obtained using single/double-precision floating point (FP) instructions. Logarithmic number systems (LNS) have been proposed [1,2] as an energy-efficient alternative to conventional FP, as several complex operations such as MUL, DIV, and EXP translate into simpler arithmetic operations in the logarithmic space and can be efficiently calculated using integer arithmetic units. However, ADD and SUB become nonlinear and have to be approximated by look-up tables (LUTs) and interpolation, which is typically implemented in a dedicated LNS unit (LNU) [1,2]. The area of LNUs grows exponentially with the desired precision, and an LNU with accuracy comparable to IEEE single-precision format is larger than a traditional floating-point unit (FPU). However, we show that in multi-core systems optimized for ultra-low-power operation such as the PULP system [3], one LNU can be efficiently shared in a cluster as indicated in Fig. 4.6.1. This arrangement not only reduces the per-core area overhead, but more importantly, allows several costly operations such as FP MUL/DIV to be processed without contention within the integer cores without additional overhead. We show that for typical nonlinear processing tasks, our LNU design can be up to 4.2× more energy efficient than a private-FP design.

IEEE Transactions on Circuits and Systems for Video Technology | 2012

Analysis and VLSI Implementation of EWA Rendering for Real-Time HD Video Applications

Pierre Greisen; Michael Schaffner; Simon Heinzle; Marian Runo; Aljosa Smolic; Andreas Burg; Hubert Kaeslin; Markus H. Gross

Nonlinear image warping or image resampling is a necessary step in many current and upcoming video applications, such as video retargeting, stereoscopic 3-D mapping, and multiview synthesis. The challenges for real-time resampling include not only image quality but also available energy and computational power of the employed device. In this paper, we employ an elliptical-weighted average (EWA) rendering approach to 2-D image resampling. We extend the classical EWA framework for increased visual quality and provide a very large scale integration architecture for efficient view rendering. The resulting architecture is able to render high-quality video sequences in real time targeted for low-power applications in end-user display devices.

design automation conference | 2014

An Approximate Computing Technique for Reducing the Complexity of a Direct-Solver for Sparse Linear Systems in Real-Time Video Processing

Michael Schaffner; Frank K. Gürkaynak; Aljoscha Smolic; Hubert Kaeslin; Luca Benini

Many video processing algorithms are formulated as least-squares problems that result in large, sparse linear systems. Solving such systems in real time is very demanding. This paper focuses on reducing the computational complexity of a direct Cholesky-decomposition-based solver. Our approximation scheme builds on the observation that, in well-conditioned problems, many elements in the decomposition nearly vanish. Such elements may be pruned from the dependency graph with mild accuracy degradation. Using an example from image-domain warping, we show that pruning reduces the amount of operations per solve by over 75 %, resulting in significant savings in computing time, area or energy.

design, automation, and test in europe | 2015

DRAM or no-DRAM?: exploring linear solver architectures for image domain warping in 28 nm CMOS

Michael Schaffner; Frank K. Gürkaynak; Aljoscha Smolic; Luca Benini

Solving large optimization problems within the energy and cost budget of mobile SoCs in real-time is a challenging task and motivates the development of specialized hardware accelerators. We present an evaluation of different linear solvers suitable for least-squares problems emanating from image processing applications such as image domain warping. In particular, we estimate implementation costs in 28 nm CMOS technology, with focus on trading on-chip memory vs. off-chip (DRAM) bandwidth. Our assessment shows large differences in circuit area, throughput and energy consumption and aims at providing a recommendation for selecting a suitable architecture. Our results emphasize that DRAM-free accelerators are an attractive choice in terms of power consumption and overall system complexity, even though they require more logic silicon area when compared to accelerators that make use of external DRAM.

IEEE Transactions on Circuits and Systems for Video Technology | 2016

Hybrid ASIC/FPGA System for Fully Automatic Stereo-to-Multiview Conversion Using IDW

Michael Schaffner; Frank K. Gürkaynak; Pierre Greisen; Hubert Kaeslin; Luca Benini; Aljosa Smolic

Recently, multiview autostereoscopic dis-plays (MADs), which enable a limited glasses-free 3D experience, have become commercially available. The main problem of MADs is that they require several (typically eight or nine) views, while most of the 3D video content is in stereoscopic 3D today. In order to bridge this gap, the research community started to devise automatic multiview synthesis (MVS) methods. These algorithms require real-time processing and should be portable to end-user devices to develop their full potential. To this end, we revisit an algorithmic solution based on image domain warping (IDW) and devise a hardware architecture of a complete synthesis pipeline, provide insights into where the computationally challenging parts are, and present implementation results of a hybrid field programmable gate array/application-specific integrated circuit prototype, which is the first hardware implementation of a complete IDW-based MVS system. Based on these results, we also estimate the complexity and energy efficiency of a fully integrated solution in 65- and 28-nm CMOS technology and show that a full-high-definition real-time solution on a single chip is within reach. The proposed architecture could be used as a coprocessor in a system-on-chip targeting 3D TV sets, thereby enabling efficient content generation with limited user interaction (e.g., depth range adjustment) in real time.

design, automation, and test in europe | 2016

High-efficiency logarithmic number unit design based on an improved cotransformation scheme

Youri Popoff; Florian Scheidegger; Michael Schaffner; Michael Gautschi; Frank K. Gürkaynak; Luca Benini

The logarithmic number system (LNS) has always been an interesting alternative for floating point calculations since the implementation of several arithmetic operations such as divisions, exponentiations and square-roots, which are required for computationally intensive nonlinear functions, is greatly simplified in the logarithmic space. However, additions and subtractions become nonlinear operations that have to be approximated using polynomials for area efficient realizations. A particular challenge is the accuracy within the so-called critical region which is encountered for subtractions where the difference between the operands is close to zero. In the literature, several arithmetic cotransformations that reduce the overhead of approximating these operations have been presented. Even so, the main problem with practical LNS realizations is the area overhead when compared to standard FPUs with comparable accuracy. In this paper, we propose a highly hardware-efficient novel cotransformation concept that not only reduces the area requirements by up to 35% when compared to the state-of-the-art, but also allows the LNU to calculate single cycle logarithms and exponentiations within the same datapath. We present comprehensive results for a complete processing system that includes the LNU and an OpenRISC based core in 65nm, and 28nm technologies. We compare this implementation with a system using a standard IEEE compliant FPU and show that the LNS based system can outperform its FP counterpart by up to 4.35× in speed. The final, pipelined LNU system when implemented in 65nm occupies an area of 54.3 kGE, allows 89 MFLOP per second and consumes 15.9-136.7 pJ per operation at 1.2V under typical conditions and 25°C.

european solid-state circuits conference | 2013

MADmax: A 1080p stereo-to-multiview rendering ASIC in 65 nm CMOS based on image domain warping

Michael Schaffner; Pierre Greisen; Simon Heinzle; Frank K. Gürkaynak; Hubert Kaeslin; Aljoscha Smolic

In this paper, a video rendering ASIC for multiview automultiscopic displays using an image domain warping approach is presented. The video rendering core is able to synthesize up to nine interleaved views from full-HD (1080p) stereoscopic 3D input footage. The design employs elliptical weighted average (EWA) splatting to perform the image resampling. We use the mathematical properties of the Gaussian filters of EWA splatting to analytically integrate display anti-aliasing into the resampling step. The use of realistic assumptions on the image transformation enable a hardware architecture that operates on a video stream in scan-line fashion and that does not require an off-chip memory. The ASIC, fabricated in a 65nm CMOS technology, runs at 260MHz and is able to deliver 28.7 interleaved full-HD (1080p) frames per second with eight views enabled. It has a core power dissipation of 550mW and its complexity is 6.8 MGE, including 4.36 MBit SRAM macros.

2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC) | 2012

A general-transformation EWA view rendering engine for 1080p video in 130 nm CMOS

Pierre Greisen; Richard Emler; Michael Schaffner; Simon Heinzle; Frank K. Gürkaynak

Current digital video pipelines are progressing towards ever higher image resolutions and frame rates, a trend which increases computational requirements of mobile end-user devices. Moreover, due to the wide variety of devices with display sub-systems, video streams often need to be adapted to the capabilities of the respective platform. In this work, we present a rendering core that is able to perform spatially-varying geometrical transforms with implicit anti-aliasing in real-time on high-definition video. The rendering is realized with a high-quality elliptical weighted average (EWA) splatting algorithm. The ASIC implementation is fabricated in a 130nm CMOS technology, and is equipped with a standard display interface and a QDRII RAM interface. The ASIC achieves at least 1080p30 (full HD) video I/O, and is able to perform per-pixel transformation on the video stream in real-time and at low latency.

static analysis symposium | 2017

StoneNode: A low-power sensor device for induced rockfall experiments

Pascal A. Niklaus; Thomas Birchler; Tim Aebi; Michael Schaffner; Lukas Cavigelli; Andrin Caviezel; Michele Magno; Luca Benini

Spontaneous occurring rockfalls are a serious danger, especially nowadays as global warming leads to a retrogression of the permafrost, which stabilized terrain in mountainous regions. In order to perform risk assessments and develop mitigation strategies, advanced simulation tools and models have been developed over the last years. These models come with many parameters and need to be calibrated and validated with real-world data to produce reliable estimates. To this end, we developed StoneNode, a rugged, small, low-power sensor device which can be embedded into boulders to measure accelerations and angular velocities. The node employs low-power MEMS sensors with high dynamic range and has a maximum operating time of more than 56 h. First field experiments confirm that the StoneNode is a reliable, easy-to-use device, which greatly facilitates the data acquisition process.

IEEE Journal of Solid-state Circuits | 2017

An Extended Shared Logarithmic Unit for Nonlinear Function Kernel Acceleration in a 65-nm CMOS Multicore Cluster

Michael Gautschi; Michael Schaffner; Frank K. Gürkaynak; Luca Benini

Energy-efficient computing and ultralow-power computing are strong requirements for various application areas, such as internet of things and wearables. While for some applications integer and fixed-point arithmetic suffice, others require a larger dynamic range, typically obtained using floating-point (FP) numbers. Logarithmic number systems (LNSs) have been proposed as energy-efficient alternative, since several complex FP operations translate into simple integer operations. However, additions and subtractions become nonlinear operations, which have to be approximated via interpolation. Even efficient LNS units (LNUs) are still larger than standard FP units (FPUs), rendering them impractical for most general-purpose processors. We show that, when shared among several cores, LNUs become a very attractive solution. A series of compact LNUs is developed, which provide significantly more functionality (such as transcendental functions) than other state-of-the-art designs. This allows, for example, to evaluate the atan2 function with three instructions for only 183.2 pJ/op at 0.8 V. We present the first shared-LNU architecture where these LNUs have been integrated into a multicore system with four 32-b-OpenRISC cores and show measurement results demonstrating that the shared-LNU design can be up to 4.1× more energy-efficient in common nonlinear processing kernels, compared with a similar area design with four private FPUs.

Explore More