Magnus Själander | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Magnus Själander is active.

Explore More

Publication

Featured researches published by Magnus Själander.

international symposium on circuits and systems | 2006

Multiplier reduction tree with logarithmic logic depth and regular connectivity

Henrik Eriksson; Per Larsson-Edefors; Mary Sheeran; Magnus Själander; Daniel Johansson; Martin Schölin

A novel partial-product reduction circuit for use in integer multiplication is presented. The high-performance multiplier (HPM) reduction tree has the ease of layout of a simple carry-save reduction array, but is in fact a high-speed low-power Dadda-style tree having a worst-case delay which depends on the logarithm (O(log TV)) of the word length N

IEEE Transactions on Very Large Scale Integration Systems | 2009

Multiplication Acceleration Through Twin Precision

Magnus Själander; Per Larsson-Edefors

We present the twin-precision technique for integer multipliers. The twin-precision technique can reduce the power dissipation by adapting a multiplier to the bitwidth of the operands being computed. The technique also enables an increased computational throughput, by allowing several narrow-width operations to be computed in parallel. We describe how to apply the twin-precision technique also to signed multiplier schemes, such as Baugh-Wooley and modified-Booth multipliers. It is shown that the twin-precision delay penalty is small (5%-10%) and that a significant reduction in power dissipation (40%-70%) can be achieved, when operating on narrow-width operands. In an application case study, we show that by extending the multiplier of a general-purpose processor with the twin-precision scheme, the execution time of a Fast Fourier Transform is reduced with 15% at a 14% reduction in datapath energy dissipation. All our evaluations are based on layout-extracted data from multipliers implemented in 130-nm and 65-nm commercial process technologies.

digital systems design | 2008

A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures

Magnus Själander; Andrei Terechko; Marc Duranton

Efficient utilization of multi-core architectures relies on the partitioning of applications into tasks and mapping the tasks to cores. In some applications (e.g. H.264 video decoding parallelized at macro-block level) these tasks have dependencies among each other. Task scheduling, consisting of selecting a task with satisfied dependencies and mapping it to a core, is typically a functionality delegated to the operating system. In this paper we present a hardware Task Management Unit (TMU) that looks ahead in time to find tasks to be executed by a multi-core architecture. The look-ahead functionality is shown to reduce the task management overhead by 40-50% when executing a parallelized version of an H.264 video decoder on an architecture with up to 16 cores. In overall, the TMU-based multi-core architecture reaches a speedup of more than 14times on 16 cores running H.264 video decoding, assuming CABAC is implemented in a dedicated coprocessor.

international conference on electronics, circuits, and systems | 2008

High-speed and low-power multipliers using the Baugh-Wooley algorithm and HPM reduction tree

Magnus Själander; Per Larsson-Edefors

The modified-Booth algorithm is extensively used for high-speed multiplier circuits. Once, when array multipliers were used, the reduced number of generated partial products significantly improved multiplier performance. In designs based on reduction trees with logarithmic logic depth, however, the reduced number of partial products has a limited impact on overall performance. The Baugh-Wooley algorithm is a different scheme for signed multiplication, but is not so widely adopted because it may be complicated to deploy on irregular reduction trees. We use the Baugh-Wooley algorithm in our High Performance Multiplier (HPM) tree, which combines a regular layout with a logarithmic logic depth. We show for a range of operator bit-widths that, when implemented in 130-nm and 65-nm process technologies, the Baugh-Wooley multipliers exhibit comparable delay, less power dissipation and smaller area foot-print than modified-Booth multipliers.

international symposium on circuits and systems | 2005

A low-leakage twin-precision multiplier using reconfigurable power gating

Magnus Själander; Mindaugas Drazdziulis; Per Larsson-Edefors; Henrik Eriksson

A twin-precision multiplier that uses reconfigurable power gating is presented. Employing power cut-off techniques in independently controlled power-gating regions yields significant static leakage reductions when half-precision multiplications are carried out. In comparison to a conventional 8-bit tree multiplier, the power overhead of a 16-bit twin-precision multiplier operating at 8-bit precision has been reduced by 53% when reconfigurable power gating based on the SCCMOS power cut-off technique was applied.

international conference on computer design | 2004

An efficient twin-precision multiplier

Magnus Själander; Henrik Eriksson; Per Larsson-Edefors

We present a twin-precision multiplier that in normal operation mode efficiently performs N-b multiplications. For applications where the demand on precision is relaxed, the multiplier can perform N/2-b multiplications while expending only a fraction of the energy of a conventional N-b multiplier. For applications with high demands on throughput, the multiplier is capable of performing two independent N/2-b multiplications in parallel. A comparison between two signed 16-b multipliers, where both perform single 8-b multiplications, shows that the twin-precision multiplier has 72% lower power dissipation and 15% higher speed than the conventional one, while only requiring 8% more transistors.

international conference on embedded computer systems: architectures, modeling, and simulation | 2007

FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

Martin Thuresson; Magnus Själander; Magnus Björk; Lars Svensson; Per Larsson-Edefors; Per Stenström

rdquoWe introduce FlexCore, the first exemplar of an architecture based on the FlexSoC framework. Comprising the same datapath units found in a conventional five-stage pipeline, the FlexCore has an exposed datapath control and a flexible interconnect to allow the datapath to be dynamically reconfigured as a consequence of code generation. Additionally, the FlexCore allows specialized datapath units to be inserted and utilized within the same architecture and compilation framework. This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain. We show that both the fine grained control and the flexible interconnect contribute to the speedup. Furthermore, according to our VLSI implementation study, the FlexCore architecture offers both time and energy savings. The exposed FlexCore datapath requires a wide control word. The conducted evaluation confirms that this increases the instruction bandwidth and memory footprint. This calls for efficient instruction decoding as proposed in the FlexSoC

application specific systems architectures and processors | 2010

Design space exploration for an embedded processor with flexible datapath interconnect

Tung Thanh Hoang; Ulf Jälmbrant; Erik der Hagopian; Kasyab Parmesh Subramaniyan; Magnus Själander; Per Larsson-Edefors

The design of an embedded processor is dependent on the application domain. Traditionally, design solutions specific to an application domain have been available in three forms: VLIW-based DSP processors, ASICs and FPGAs; each respectively offering generality of application domain, energy efficiency and flexibility. However, while matching the application domain to the resources needed, the design space becomes huge. We present FlexTools, a tool framework built around the FlexCore architecture to evaluate performance and energy efficiency for different applications. Here we demonstrate FlexTools for design space exploration with a focus on the data-routing flexibility of the FlexCore processor, in search of energy-efficient interconnect configurations that are both cycle-count and hardware efficient. Evaluation results suggest that a well-optimized instance of a 65-nm multiplier-extended FlexCore processor datapath, obtained using FlexTools, executes nine integer EEMBC benchmarks with a 15% cycle count reduction and dissipates 17% less energy than a reference MIPS datapath.

ieee computer society annual symposium on vlsi | 2007

A Flexible Datapath Interconnect for Embedded Applications

Magnus Själander; Per Larsson-Edefors; Magnus Björk

We investigate the effects of introducing a flexible interconnect into an exposed datapath. We define an exposed datapath as a traditional GPP datapath that has its normal control removed, leading to the exposure of a wide control word. For an FFT benchmark, the introduction of a flexible interconnects reduces the total execution time by 16%. Compared to a traditional GPP, the execution time for an exposed datapath using a flexible interconnect is 32% shorter whereas the energy dissipation is 29% lower. Our investigation is based on a cycle-accurate architectural simulator and figures on delay, power, and area are obtained from placed-and-routed layouts in a commercial 0.13-mum technology. The results from our case studies indicate that by utilizing a flexible interconnect, significant performance gains can be achieved for generic applications.

international symposium on performance analysis of systems and software | 2012

An LTE Uplink Receiver PHY benchmark and subframe-based power management

Magnus Själander; Sally A. McKee; Peter Brauer; David Engdal; András Vajda

With the proliferation of mobile phones and other mobile internet appliances, the application area of baseband processing continues to grow in importance. Much academic research addresses the underlying mathematics, but little has been published on the design of systems to execute baseband workloads. Most systems research is conducted within companies who go to great lengths to protect their intellectual property. We present an open-source LTE Uplink Receiver PHY benchmark with a realistic representation of the baseband processing of an LTE base station, and we demonstrate its usefulness in investigating resource management strategies to conserve power on a TILEPro64. By estimating the workload of each subframe and using these estimates to control power-gating, we reduce power consumption by more than 24% (11% on average) compared to executing the benchmark with no estimation-guided resource management. By making available a benchmark containing no proprietary algorithms, we enable a broader community to conduct research both in baseband processing and on the systems that are used to execute such workloads.

Explore More