Martin Thuresson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martin Thuresson is active.

Explore More

Publication

Featured researches published by Martin Thuresson.

IEEE Transactions on Computers | 2008

Memory-Link Compression Schemes: A Value Locality Perspective

Martin Thuresson; Lawrence Spracklen; Per Stenström

As the speed of processors increases, the on-chip memory hierarchy will continue to be crucial for performance. Unfortunately, simply increasing the size of the on-chip caches yields diminishing returns and memory-bound applications may suffer from the limited off-chip bandwidth. This paper focuses on memory-link compression schemes. The first contribution is a framework for identifying the nature of the value locality exploited by published schemes. This framework is then used to quantitatively establish what type of value locality is exploited by each compression scheme. We find that as much as 40 percent of the values transferred in integer, media, and commercial applications are small integers and can be coded using less than 8 bits. By leveraging small-value locality, 35 percent of the bandwidth can be freed up. Another significant chunk of the values either forms clusters in the value space or belongs to a fairly small group of frequent isolated values. By leveraging this category, one can free up 70 percent of the bandwidth. Finally, we contribute a new compression scheme that exploits multiple value-locality categories and is shown to free up 75 percent of the bandwidth.

international conference on embedded computer systems: architectures, modeling, and simulation | 2007

FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

Martin Thuresson; Magnus Själander; Magnus Björk; Lars Svensson; Per Larsson-Edefors; Per Stenström

rdquoWe introduce FlexCore, the first exemplar of an architecture based on the FlexSoC framework. Comprising the same datapath units found in a conventional five-stage pipeline, the FlexCore has an exposed datapath control and a flexible interconnect to allow the datapath to be dynamically reconfigured as a consequence of code generation. Additionally, the FlexCore allows specialized datapath units to be inserted and utilized within the same architecture and compilation framework. This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain. We show that both the fine grained control and the flexible interconnect contribute to the speedup. Furthermore, according to our VLSI implementation study, the FlexCore architecture offers both time and energy savings. The exposed FlexCore datapath requires a wide control word. The conducted evaluation confirms that this increases the instruction bandwidth and memory footprint. This calls for efficient instruction decoding as proposed in the FlexSoC

high performance embedded architectures and compilers | 2008

A Flexible Code Compression Scheme Using Partitioned Look-Up Tables

Martin Thuresson; Magnus Själander; Per Stenström

Wide instruction formats make it possible to control microarchitecture resources more precisely by the compiler by either enabling more parallelism (VLIW) or by saving power. Unfortunately, wide instructions impose a high pressure on the memory system due to an increased instruction-fetch bandwidth and a larger code working set/footprint. This paper presents a code compression scheme that allows the compiler to select what subset of a wide instruction set to use in each program phase at the granularity of basic blocks based on a profiling methodology. The decompression engine comprises a set of tables that convert a narrow instruction into a wide instruction in a dynamic fashion. The paper also presents a method for how to configure and dimension the decompression engine and how to generate a compressed program with embedded instructions that dynamically manage the tables in the decompression engine. We find that the 77 control bits in the original FlexCore instruction format can be reduced to 32 bits offering a compression of 58% and a modest performance overhead of less than 1% for management of the decompression tables.

computing frontiers | 2005

Evaluation of extended dictionary-based static code compression schemes

Martin Thuresson; Per Stenström

This paper evaluates how much extended dictionary-based code compression techniques can reduce the static code size. In their simplest form, such methods statically identify identical instruction sequences in the code and replace them by a codeword if they yield a smaller code size based on a heuristic. At run-time, the codeword is replaced by a dictionary entry storing the corresponding instruction sequence.Two previously proposed schemes are evaluated. The first scheme, as used in DISE, provides operand parameters to catch a larger number of identical instruction sequences. The second scheme replaces different instruction sequences with the same dictionary entry if they can be derived from it using a bit mask that can cancel out individual instructions. Additionally, this paper offers a third scheme, namely, to combine the two previously proposed schemes along with an off-line algorithm to compress the program. Our data shows that all schemes in isolation improve the compressibility. However, the most important finding is that the number of operand parameters has a significant effect on the compressibility. In addition, our proposed combined scheme can reduce the size of the dictionary and the number of codewords significantly which can enable efficient implementations of extended dictionary-based code compression techniques

symposium on computer architecture and high performance computing | 2006

Scalable Value-Cache Based Compression Schemes for Multiprocessors

Martin Thuresson; Per Stenström

Data link compression can efficiently compress the data stream between main memory and the processor chip in single processor systems. By dynamically updating a value cache on each side of the link with the most frequently transmitted values, frequent value encoding can compress the data stream by up to 70%. Unfortunately, the number of value caches needed grows quadratically with the number of nodes in multiprocessors which causes a scalability problem. This paper shows that by sharing the caches between different pairs of communicating nodes, the frequent values stored at each node can be utilized more efficiently. For interconnects with point-to-point links, it is shown, however, that sharing of caches introduces overhead traffic for keeping the value caches consistent. If all misses in the shared cache are broadcast to all other nodes, the generated traffic becomes so large, that it is better to transmit the values uncompressed. We propose and evaluate three techniques that aim at reducing this overhead and find that it is possible to reduce most of this traffic, but at the cost of less efficient compression and the final result is comparable to using dedicated value caches

international conference on parallel processing | 2008

Accommodation of the Bandwidth of Large Cache Blocks Using Cache/Memory Link Compression

Martin Thuresson; Per Stenström

The mismatch between processor and memory speed continues to make design issues for memory hierarchies important. While larger cache blocks can exploit more spatial locality, they increase the off-chip memory bandwidth; a scarce resource in future microprocessor designs. We show that it is possible to use larger block sizes without increasing the off-chip memory bandwidth by applying compression techniques to cache/memory block transfers. Since bandwidth is reduced by up to a factor of three, we propose to use larger blocks. While compression/decompression ends up on the critical memory access path, we find that its negative impact on the memory access latency time is often dwarfed by the performance gains from larger block sizes. Our proposed scheme uses a previous mechanism for dynamically choosing a larger cache block when advantageous given the spatial locality in combination with compression. This combined scheme consistently improves performance on average by 19%.

high performance embedded architectures and compilers | 2006

Exposed Datapath for Efficient Computing

Magnus Björk; Magnus Själander; Lars Svensson; Martin Thuresson; John Hughes; Kjell Jeppson; Jonas Karlsson; Per Larsson-Edefors; Mary Sheeran; Per Stenström

Archive | 2006