Is this you? Create Your Porfile

Magnus Jahre

Norwegian University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Magnus Jahre is active.

Explore More

Publication

Featured researches published by Magnus Jahre.

field programmable gate arrays | 2017

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Yaman Umuroglu; Nicholas J. Fraser; Giulio Gambardella; Michaela Blott; Philip Heng Wai Leong; Magnus Jahre; Kees A. Vissers

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 μs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 μs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.

IEEE Transactions on Education | 2009

Experimental Validation of the Learning Effect for a Pedagogical Game on Computer Fundamentals

Guttorm Sindre; Lasse Natvig; Magnus Jahre

The question/answer-based computer game Age of Computers was introduced to replace traditional weekly paper exercises in a course in computer fundamentals in 2003. Questionnaire evaluations and observation of student behavior have indicated that the students found the game more motivating than paper exercises and that a majority of the students also perceived the game to have a higher learning effect than paper exercises or textbook reading. This paper reports on a controlled experiment to compare the learning effectiveness of game play with traditional paper exercises, as well as with textbook reading. The results indicated that with equal time being spent on the various learning activities, the effect of game play was only equal to that of the other activities, not better. Yet this result is promising enough, as the increased motivation means that students work harder in the course. Also, the results indicate that the game has potential for improvement, in particular with respect to its feedback on the more complicated questions.

computing frontiers | 2009

A light-weight fairness mechanism for chip multiprocessor memory systems

Magnus Jahre; Lasse Natvig

Chip Multiprocessor (CMP) memory systems suffer from the effects of destructive thread interference. This interference reduces performance predictability because it depends heavily on the memory access pattern and intensity of the co-scheduled threads. In this work, we confirm that all shared units must be thread-aware in order to provide memory system fairness. However, the current proposals for fair memory systems are complex as they require an interference measurement mechanism and a fairness enforcement policy for all hardware-controlled shared units. Furthermore, they often sacrifice system throughput to reach their fairness goals which is not desirable in all systems. In this work, we show that our novel fairness mechanism, called the Dynamic Miss Handling Architecture (DMHA), is able to reduce implementation complexity by using a single fairness enforcement policy for the complete hardware-managed shared memory system. Specifically, it controls the total miss bandwidth available to each thread by dynamically manipulating the number of Miss Status Holding Registers (MSHRs) available in each private data cache. When fairness is chosen as the metric of interest and we compare to a state-of-the-art fairness-aware memory system, DMHA improves fairness by 26% on average with the single program baseline. With a different configuration, DMHA improves throughput by 13% on average compared to a conventional memory system.

high performance embedded architectures and compilers | 2017

Scaling Binarized Neural Networks on Reconfigurable Logic

Nicholas J. Fraser; Yaman Umuroglu; Giulio Gambardella; Michaela Blott; Philip Heng Wai Leong; Magnus Jahre; Kees A. Vissers

Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to reconfigurable logic devices, which contain an abundance of fine-grained compute resources and can result in smaller, lower power implementations, or conversely in higher classification rates. Towards this end, the FINN framework was recently proposed for building fast and flexible field programmable gate array (FPGA) accelerators for BNNs. FINN utilized a novel set of optimizations that enable efficient mapping of BNNs to hardware and implemented fully connected, non-padded convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. However, FINN was not evaluated on larger topologies due to the size of the chosen FPGA, and exhibited decreased accuracy due to lack of padding. In this paper, we improve upon FINN to show how padding can be employed on BNNs while still maintaining a 1-bit datapath and high accuracy. Based on this technique, we demonstrate numerous experiments to illustrate flexibility and scalability of the approach. In particular, we show that a large BNN requiring 1.2 billion operations per frame running on an ADM-PCIE-8K5 platform can classify images at 12 kFPS with 671 μs latency while drawing less than 41 W board power and classifying CIFAR-10 images at 88.7% accuracy. Our implementation of this network achieves 14.8 trillion operations per second. We believe this is the fastest classification rate reported to date on this benchmark at this level of accuracy.

field programmable logic and applications | 2015

Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform

Yaman Umuroglu; Donn Morrison; Magnus Jahre

Large and sparse small-world graphs are ubiquitous across many scientific domains from bioinformatics to computer science. As these graphs grow in scale, traversal algorithms such as breadth-first search (BFS), fundamental to many graph processing applications and metrics, become more costly to compute. The cause is attributed to poor temporal and spatial locality due to the inherently irregular memory access patterns of these algorithms. A large body of research has targeted accelerating and parallelizing BFS on a variety of computing platforms, including hybrid CPU-GPU approaches for exploiting the small-world property. In the same spirit, we show how a single-die FPGA-CPU heterogeneous device can be used to leverage the varying degree of parallelism in small-world graphs. Additionally, we demonstrate how dense rather than sparse treatment of the BFS frontier vector yields simpler memory access patterns for BFS, trading redundant computation for DRAM bandwidth utilization and faster graph exploration. On a range of synthetic small-world graphs, our hybrid approach performs 7.8× better than software-only and 2× better than accelerator-only implementations. We achieve an average traversal speed of 172 MTEPS (millions of traversed edges per second) on the ZedBoard platform, which is more than twice as effective as the best previously published FPGA BFS implementation in terms of traversals per bandwidth.

Computing | 2017

The READEX formalism for automatic tuning for energy efficiency

Joseph Schuchart; Michael Gerndt; Per Gunnar Kjeldsberg; Michael Lysaght; David Horák; Lubomír Říha; Andreas Gocht; Mohammed Sourouri; Madhura Kumaraswamy; Anamika Chowdhury; Magnus Jahre; Kai Diethelm; Othman Bouizi; Umbreen Sabir Mian; Jakub Kružík; Radim Sojka; Martin Beseda; Venkatesh Kannan; Zakaria Bendifallah; Daniel Hackenberg; Wolfgang E. Nagel

Energy efficiency is an important aspect of future exascale systems, mainly due to rising energy cost. Although High performance computing (HPC) applications are compute centric, they still exhibit varying computational characteristics in different regions of the program, such as compute-, memory-, and I/O-bound code regions. Some of today’s clusters already offer mechanisms to adjust the system to the resource requirements of an application, e.g., by controlling the CPU frequency. However, manually tuning for improved energy efficiency is a tedious and painstaking task that is often neglected by application developers. The European Union’s Horizon 2020 project READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) aims at developing a tools-aided approach for improved energy efficiency of current and future HPC applications. To reach this goal, the READEX project combines technologies from two ends of the compute spectrum, embedded systems and HPC, constituting a split design-time/runtime methodology. From the HPC domain, the Periscope Tuning Framework (PTF) is extended to perform dynamic auto-tuning of fine-grained application regions using the systems scenario methodology, which was originally developed for improving the energy efficiency in embedded systems. This paper introduces the concepts of the READEX project, its envisioned implementation, and preliminary results that demonstrate the feasibility of this approach.

high performance computing and communications | 2009

A Quantitative Study of Memory System Interference in Chip Multiprocessor Architectures

Magnus Jahre; Marius Grannæs; Lasse Natvig

The potential for destructive interference between running processes is increased as Chip Multiprocessors (CMPs) share more on-chip resources. We believe that understanding the nature of memory system interference is vital to achieve good fairness/complexity/performance trade-offs in CMPs. Our goal in this work is to quantify the latency penalties due to interference in all hardware-controlled, shared units (i.e. the on-chip interconnect, shared cache and memory bus). To achieve this, we simulate a wide variety of realistic CMP architectures. In particular, we vary the number of cores, interconnect topology, shared cache size and off-chip memory bandwidth. We observe that interference in the off-chip memory bus accounts for between 63% and 87% of the total interference impact while the impact of cache capacity interference can be lower than indicated by previous studies (between 5% and 32% of the total impact). In addition, as much as 11% of the total impact can be due to uncontrolled allocation of shared cache Miss Status Holding Registers (MSHRs).

Computing | 2015

ParVec: vectorizing the PARSEC benchmark suite

Juan M. Cebrian; Magnus Jahre; Lasse Natvig

Energy efficiency has recently replaced performance as the main design goal for microprocessors across all market segments. Vectorization, parallelization, specialization and heterogeneity are the key approaches that both academia and industry embrace to make energy efficiency a reality. New architectural proposals are validated against real applications in order to ensure correctness and perform performance and energy evaluations. However, keeping up with architectural changes while maintaining similar workloads and algorithms (for comparative purposes) becomes a real challenge. If benchmarks are optimized for certain features and not for others, architects may end up overestimating the impact of certain techniques and underestimating others. The main contribution of this work is a detailed description and evaluation of ParVec, a vectorized version of the PARSEC benchmark suite (as a case study of a commonly used application set). ParVec can target SSE, AVX and NEON™ SIMD architectures by means of custom vectorization and math libraries. The performance and energy efficiency improvements from vectorization depend greatly on the fraction of code that can be vectorized. Vectorization-friendly benchmarks obtain up to 10

high performance embedded architectures and compilers | 2011