Erik Hansson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Erik Hansson is active.

Explore More

Publication

Featured researches published by Erik Hansson.

International Journal of Parallel Programming | 2013

Extensible Recognition of Algorithmic Patterns in DSP Programs for Automatic Parallelization

Amin Shafiee Sarvestani; Erik Hansson; Christoph W. Kessler

We introduce an extensible knowledge based tool for idiom (pattern) recognition in DSP (digital signal processing) programs. Our tool utilizes functionality provided by the Cetus compiler infrastructure for detecting certain computation patterns that frequently occur in DSP code. We focus on recognizing patterns for for-loops and statements in their bodies as these often are the performance critical constructs in DSP applications for which replacement by highly optimized, target-specific parallel algorithms will be most profitable. For better structuring and efficiency of pattern recognition, we classify patterns by different levels of complexity such that patterns in higher levels are defined in terms of lower level patterns. The tool works statically on the intermediate representation. For better extensibility and abstraction, most of the structural part of recognition rules is specified in XML form to separate the tool implementation from the pattern specifications. Information about detected patterns will later be used for optimized code generation by local algorithm replacement e.g. for the low-power high-throughput multicore DSP architecture ePUMA.

automation, robotics and control systems | 2012

Flexible scheduling and thread allocation for synchronous parallel tasks

Christoph W. Kessler; Erik Hansson

We describe a task model and dynamic scheduling and resource allocation mechanism for synchronous parallel tasks to be executed on SPMD-programmed synchronous shared-memory MIMD parallel architectures with uniform, unit-time memory access and strict memory consistency, also known in the literature as PRAMs (Parallel Random Access Machines). Our task model provides a two-tier programming model for PRAMs that flexibly combines SPMD and fork-join parallelism within the same application. It offers flexibility by dynamic scheduling and late resource binding while preserving the PRAM execution properties within each task, the only limitation being that the maximum number of threads that can be assigned to a task is limited to what the underlying architecture provides. In particular, our approach opens for automatic performance tuning at run-time by controlling the thread allocation for tasks based on run-time predictions. By a prototype implementation of a synchronous parallel task API in the SPMD-based PRAM language Fork and experimental evaluation with example programs on the SBPRAM simulator, we show that a realization of the task model on a SPMD-programmable PRAM machine is feasible with moderate runtime overhead per task.

international symposium on parallel and distributed processing and applications | 2012

Design of the Language Replica for Hybrid PRAM-NUMA Many-core Architectures

Jari-Matti Mäkelä; Erik Hansson; Daniel Åkesson; Martti Forsell; Christoph W. Kessler; Ville Leppänen

Parallel programming is widely considered very demanding for an average programmer due to inherent asynchrony of underlying parallel architectures. In this paper we describe the main design principles and core features of Replica -- a parallel language aimed for high-level programming of a new paradigm of reconfigurable, scalable and powerful synchronous shared memory architectures that promise to make parallel programming radically easier with the help of strict memory consistency and deterministic synchronous execution of hardware threads and multi-operations.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Hardware and Software Support for NUMA Computing on Configurable Emulated Shared Memory Architectures

Martti Forsell; Erik Hansson; Christoph W. Kessler; Jari-Matti Mäkelä; Ville Leppänen

The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide easy-to-use explicitly parallel synchronous model of computation to programmers as well as avoid most performance bottlenecks present in current multicore architectures. In order to achieve full performance the applications must, however, have enough thread-level parallelism (TLP). To solve this problem, in our earlier work we have introduced a class of configurable emulated shared memory (CESM) machines that provides a special non-uniform memory access (NUMA) mode for situations where TLP is limited or for direct compatibility for legacy code sequential computing or NUMA mechanism. Unfortunately the earlier proposed CESM architecture does not integrate the different modes of the architecture well together e.g. by leaving the memories for different modes isolated and therefore the programming interface is non-integrated. In this paper we propose a number of hardware and software techniques to support NUMA computing in CESM architectures in a seamless way. The hardware techniques include three different NUMA-shared memory access mechanisms and the software ones provide a mechanism to integrate NUMA computation into the standard parallel random access machine (PRAM) operation of the CESM. The hardware techniques are evaluated on our REPLICA CESM architecture and compared to an ideal CESM machine making use of the proposed software techniques.

complex, intelligent and software intensive systems | 2011

Case Study of Efficient Parallel Memory Access Programming for the Embedded Heterogeneous Multicore DSP Architecture ePUMA

Erik Hansson; Joar Sohl; Christoph W. Kessler; Dake Liu

We consider the challenges in writing efficient code for ePUMA, a novel domain-specific heterogeneous multicore architecture with SIMD DSP slave cores, multi-banked on-chip vector register files for parallel access and configurable permutation hardware that decouples memory access from computation. Suitable data layout in memory and in vector registers, combined with using ePUMAs powerful addressing modes, is key to exploiting SIMD units efficiently and achieving the throughput required for prospective applications in 4G mobile telecommunication and multimedia.

international symposium on computing and networking | 2014

Global Optimization of Execution Mode Selection for the Reconfigurable PRAM-NUMA Multicore Architecture REPLICA

Erik Hansson; Christoph W. Kessler

The REPLICA architecture is a massively hardware threaded very long instruction word (VLIW) architecture. REPLICA has two execution modes supported by the underlying on-chip memory, PRAM and NUMA which can be switched between at runtime. PRAM mode is considered the standard execution mode and targets mainly applications with very high thread level parallelism (TLP). In contrast, NUMA mode is for sequential legacy applications and applications with low amount of TLP, but for some cases very regular applications suits NUMA mode as well. However, there is a switching cost between the modes which is not neglect able. We combine machine-learning (symbolic regression) with shortest path problem to optimize software composition of parameterized stencil-like algorithms which have regular control flow and memory access pattern. Using the tool Eureqa Pro which is based on symbolic regression and training data we can create predictors for execution time for parameterized software components. We use the predictors and formulate an optimization problem based on shortest path to map component execution on the available modes (PRAM or NUMA). When composing for three randomly selected components from an evaluation set we get speedups up to 2.9 times including overhead and an average speedup of 1.4 also including overhead. Overhead costs which includes running predictors, solving shortest path and switching to the selected runtime modes are just a few percent.

european conference on parallel processing | 2014

Optimized Selection of Runtime Mode for the Reconfigurable PRAM-NUMA Architecture REPLICA Using Machine-Learning

Erik Hansson; Christoph W. Kessler

The massively hardware multithreaded VLIW emulated shared memory (ESM) architecture REPLICA has a dynamically reconfigurable on-chip network that offers two execution modes: PRAM and NUMA. PRAM mode is mainly suitable for applications with high amount of thread level parallelism (TLP) while NUMA mode is mainly for accelerating execution of sequential programs or programs with low TLP. Also, some types of regular data parallel algorithms execute faster in NUMA mode. It is not obvious in which mode a given program region shows the best performance. In this study we focus on generic stencil-like computations exhibiting regular control flow and memory access pattern. We use two state-of-the art machine-learning methods, C5.0 (decision trees) and Eureqa Pro (symbolic regression) to select which mode to use.We use these methods to derive different predictors based on the same training data and compare their results. The accuracy of the best derived predictors are 95% and are generated by both C5.0 and Eureqa Pro, although the latter can in some cases be more sensitive to the training data. The average speedup gained due to mode switching ranges between 1.92 to 2.23 for all generated predictors on the evaluation test cases, and using a majority voting algorithm, based on the three best predictors, we can eliminate all misclassifications.

arcs workshops | 2014