Per Larsson-Edefors | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Per Larsson-Edefors is active.

Explore More

Publication

Featured researches published by Per Larsson-Edefors.

international symposium on circuits and systems | 2006

Multiplier reduction tree with logarithmic logic depth and regular connectivity

Henrik Eriksson; Per Larsson-Edefors; Mary Sheeran; Magnus Själander; Daniel Johansson; Martin Schölin

A novel partial-product reduction circuit for use in integer multiplication is presented. The high-performance multiplier (HPM) reduction tree has the ease of layout of a simple carry-save reduction array, but is in fact a high-speed low-power Dadda-style tree having a worst-case delay which depends on the logarithm (O(log TV)) of the word length N

IEEE Transactions on Very Large Scale Integration Systems | 2009

Multiplication Acceleration Through Twin Precision

Magnus Själander; Per Larsson-Edefors

We present the twin-precision technique for integer multipliers. The twin-precision technique can reduce the power dissipation by adapting a multiplier to the bitwidth of the operands being computed. The technique also enables an increased computational throughput, by allowing several narrow-width operations to be computed in parallel. We describe how to apply the twin-precision technique also to signed multiplier schemes, such as Baugh-Wooley and modified-Booth multipliers. It is shown that the twin-precision delay penalty is small (5%-10%) and that a significant reduction in power dissipation (40%-70%) can be achieved, when operating on narrow-width operands. In an application case study, we show that by extending the multiplier of a general-purpose processor with the twin-precision scheme, the execution time of a Fast Fourier Transform is reduced with 15% at a 14% reduction in datapath energy dissipation. All our evaluations are based on layout-extracted data from multipliers implemented in 130-nm and 65-nm commercial process technologies.

international conference on electronics circuits and systems | 1999

A leakage-tolerant multi-phase keeper for wide domino circuits

Atila Alvandpour; Per Larsson-Edefors; Christer Svensson

The usefulness and the performance of high fan-in domino gates, having transistor stacks with few transistors, are seriously affected by the increasing subthreshold leakage current. In this paper we present a novel static keeper, which has variable driving strength in the evaluation phase. The keeper has a low gain during the output transition and, conditionally, a high gain during the rest of the evaluation phase. Hence, a larger amount of leakage and noise can be tolerated at no expense in delay.

system on chip conference | 2010

A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit

Tung Thanh Hoang; Magnus Själander; Per Larsson-Edefors

We propose a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports twos complement numbers, and includes accumulation guard bits and saturation circuitry. The first MAC pipeline stage contains only partial-product generation circuitry and a reduction tree, while the second stage, thanks to a special sign-extension solution, implements all other functionality. Place-and-route evaluations using a 65-nm 1.1-V cell library show that the proposed architecture offers a 31% improvement in speed and a 32% reduction in energy per operation, averaged across operand sizes of 16, 32, 48, and 64 bits, over a reference two-cycle MAC architecture that employs a multiplier in the first stage and an accumulator in the second. When operating the proposed architecture at the lower frequency of the reference architecture the available timing slack can be used to downsize gates, resulting in a 52% reduction in energy compared to the reference. We extend the new architecture to create a versatile double-throughput MAC (DTMAC) unit that efficiently performs either multiply-accumulate or multiply operations for N-bit, 1 × N/2-bit, or 2 × N/2-bit operands. In comparison to a fixed-function 32-bit MAC unit, 16-bit multiply-accumulate operations can be executed with 67% higher energy efficiency on a 32-bit DTMAC unit.

international conference on electronics, circuits, and systems | 2008

High-speed and low-power multipliers using the Baugh-Wooley algorithm and HPM reduction tree

Magnus Själander; Per Larsson-Edefors

The modified-Booth algorithm is extensively used for high-speed multiplier circuits. Once, when array multipliers were used, the reduced number of generated partial products significantly improved multiplier performance. In designs based on reduction trees with logarithmic logic depth, however, the reduced number of partial products has a limited impact on overall performance. The Baugh-Wooley algorithm is a different scheme for signed multiplication, but is not so widely adopted because it may be complicated to deploy on irregular reduction trees. We use the Baugh-Wooley algorithm in our High Performance Multiplier (HPM) tree, which combines a regular layout with a logarithmic logic depth. We show for a range of operator bit-widths that, when implemented in 130-nm and 65-nm process technologies, the Baugh-Wooley multipliers exhibit comparable delay, less power dissipation and smaller area foot-print than modified-Booth multipliers.

international symposium on quality electronic design | 2006

Parameterizable Architecture-Level SRAM Power Model Using Circuit-Simulation Backend for Leakage Calibration

Minh Quang Do; Mindaugas Drazdziulis; Per Larsson-Edefors; Lars Bengtsson

We propose an accurate architecture-level power estimation method for SRAM memories. This hybrid method is composed of an analytical part for dynamic power estimation and a circuit-simulation backend used to obtain static leakage power values of all basic memory components. The method is flexible in that memory size is an arbitrary parameter. In a comparison to circuit-level simulations (Hspice) of complete 2 KBytes and 8 KBytes 6T-SRAM memories implemented both in 0.13-mum and 65-nm (BPTM) bulk CMOS processes, the proposed method shows a high accuracy in estimating leakage power

international symposium on circuits and systems | 2005

A low-leakage twin-precision multiplier using reconfigurable power gating

Magnus Själander; Mindaugas Drazdziulis; Per Larsson-Edefors; Henrik Eriksson

A twin-precision multiplier that uses reconfigurable power gating is presented. Employing power cut-off techniques in independently controlled power-gating regions yields significant static leakage reductions when half-precision multiplications are carried out. In comparison to a conventional 8-bit tree multiplier, the power overhead of a 16-bit twin-precision multiplier operating at 8-bit precision has been reduced by 53% when reconfigurable power gating based on the SCCMOS power cut-off technique was applied.

european solid-state circuits conference | 2003

A gate leakage reduction strategy for future CMOS circuits

Mindaugas Drazdziulis; Per Larsson-Edefors

We show that a technique previously introduced for sub-threshold leakage reduction can be effectively used to reduce gate leakage dissipation in future CMOS circuits operating in stand-by mode. The technique gave one order of magnitude gate leakage savings with a certain input pattern for the evaluated two-input NAND gate. Also, we make a detailed analysis of mechanisms causing different direct oxide tunnelling currents that contributes to gate leakage power dissipation in future CMOS circuits.

international conference on computer design | 2004

An efficient twin-precision multiplier

Magnus Själander; Henrik Eriksson; Per Larsson-Edefors

We present a twin-precision multiplier that in normal operation mode efficiently performs N-b multiplications. For applications where the demand on precision is relaxed, the multiplier can perform N/2-b multiplications while expending only a fraction of the energy of a conventional N-b multiplier. For applications with high demands on throughput, the multiplier is capable of performing two independent N/2-b multiplications in parallel. A comparison between two signed 16-b multipliers, where both perform single 8-b multiplications, shows that the twin-precision multiplier has 72% lower power dissipation and 15% higher speed than the conventional one, while only requiring 8% more transistors.

international conference on embedded computer systems: architectures, modeling, and simulation | 2007

FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

Martin Thuresson; Magnus Själander; Magnus Björk; Lars Svensson; Per Larsson-Edefors; Per Stenström

rdquoWe introduce FlexCore, the first exemplar of an architecture based on the FlexSoC framework. Comprising the same datapath units found in a conventional five-stage pipeline, the FlexCore has an exposed datapath control and a flexible interconnect to allow the datapath to be dynamically reconfigured as a consequence of code generation. Additionally, the FlexCore allows specialized datapath units to be inserted and utilized within the same architecture and compilation framework. This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain. We show that both the fine grained control and the flexible interconnect contribute to the speedup. Furthermore, according to our VLSI implementation study, the FlexCore architecture offers both time and energy savings. The exposed FlexCore datapath requires a wide control word. The conducted evaluation confirms that this increases the instruction bandwidth and memory footprint. This calls for efficient instruction decoding as proposed in the FlexSoC

Explore More