Mads Ruben Burgdorff Kristensen

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mads Ruben Burgdorff Kristensen is active.

Explore More

Publication

Featured researches published by Mads Ruben Burgdorff Kristensen.

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model | 2010

Numerical Python for scalable architectures

Mads Ruben Burgdorff Kristensen; Brian Vinter

In this paper, we introduce DistNumPy, a library for doing numerical computation in Python that targets scalable distributed memory architectures. DistNumPy extends the NumPy module[15], which is popular for scientific programming. Replacing NumPy with Dist-NumPy enables the user to write sequential Python programs that seamlessly utilize distributed memory architectures. This feature is obtained by introducing a new backend for NumPy arrays, which distribute data amongst the nodes in a distributed memory multi-processor. All operations on this new array will seek to utilize all available processors. The array itself is distributed between multiple processors in order to support larger arrays than a single node can hold in memory. We perform three experiments of sequential Python programs running on an Ethernet based cluster of SMP-nodes with a total of 64 CPU-cores. The results show an 88% CPU utilization when running a Monte Carlo simulation, 63% CPU utilization on an N-body simulation and a more modest 50% on a Jacobi solver. The primary limitation in CPU utilization is identified as SMP limitations and not the distribution aspect. Based on the experiments we find that it is possible to obtain significant speedup from using our new array-backend without changing the original Python code.

international parallel and distributed processing symposium | 2014

Bohrium: A Virtual Machine Approach to Portable Parallelism

Mads Ruben Burgdorff Kristensen; Simon Andreas Frimann Lund; Troels Blum; Kenneth Skovhede; Brian Vinter

In this paper we introduce, Bohrium, a runtime-system for mapping vector operations onto a number of different hardware platforms, from simple multi-core systems to clusters and GPU enabled systems. In order to make efficient choices Bohrium is implemented as a virtual machine that makes runtime decisions, rather than a statically compiled library, which is the more common approach. In principle, Bohrium can be used for any programming language but for now, the supported languages are limited to Python, C++ and the. Net framework, e.g. C# and F#. The primary success criteria are to maintain a complete abstraction from low-level details and to provide efficient code execution across different, current and future, processors. We evaluate the presented design through a setup that targets a multi-core CPU, an eight-node Cluster, and a GPU, all preliminary prototypes. The evaluation includes three well-known benchmark applications, Black Sholes, Shallow Water, and N-body, implemented in C++, Python, and C# respectively.

international conference on parallel architectures and compilation techniques | 2016

Fusion of Parallel Array Operations

Mads Ruben Burgdorff Kristensen; Simon Andreas Frimann Lund; Troels Blum; James Emil Avery

We address the problem of fusing array operations based on criteria such as shape compatibility, data reuse, and minimizing communication. We formulate the problem as a partitioning problem (WSP) that is general enough to handle loop fusion, combinator fusion, and other types of fusion analysis. Traditionally, when optimizing for data reuse, the fusion problem has been formulated as a static weighted graph partitioning problem (known as the Weighted Loop Fusion problem). We show that this scheme cannot accurately track data reuse between multiple independent loops, since it overestimates total data reuse of certain cases. Our formulation in terms of partitions allows use of realistic cost functions that can track resource usage accurately. We give correctness proofs, and prove that WSP can maximize data reuse in programs exactly, in contrast to prior work. For the exact optimal solution, which is NP-hard to find, we present a branch-and-bound algorithm together with a polynomial-time preconditioner that reduces the problem size significantly in practice. We further present a polynomialtime greedy approximation that is fast enough to use for JIT-compilation and gives near-optimal results in practice. All algorithms have been implemented in the automatic parallelization platform Bohrium, run on a set of benchmarks, and compared to existing methods from the literature.

ieee international conference on high performance computing data and analytics | 2012

Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

Mads Ruben Burgdorff Kristensen; Brian Vinter

This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for six benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of the total execution time, in a stencil application.

international parallel and distributed processing symposium | 2009

GPAW optimized for Blue Gene/P using hybrid programming

Mads Ruben Burgdorff Kristensen; Hans Henrik Happe; Brian Vinter

In this work we present optimizations of a Grid-based projector-augmented wave method software, GPAW [1] for the Blue Gene/P architecture. The improvements are achieved by exploring the advantage of shared and distributed memory programming also known as hybrid programming. The work focuses on optimizing a very time consuming operation in GPAW, the finite-different stencil operation, and different hybrid programming approaches are evaluated. The work succeeds in demonstrating a hybrid programming model which is clearly beneficial compared to the original flat programming model. In total an improvement of 1.94 compared to the original implementation is obtained. The results we demonstrate here are reasonably general and may be applied to other finite difference codes.

international parallel and distributed processing symposium | 2014

Transparent GPU Execution of NumPy Applications

Troels Blum; Mads Ruben Burgdorff Kristensen; Brian Vinter

In this work, we present a back-end for the Python library NumPy that utilizes the GPU seamlessly. We use dynamic code generation to generate kernels, and data is moved transparently to and from the GPU. For the integration into NumPy, we use the Bohrium runtime system. Bohrium hooks into NumPy through the implicit data parallelization of array operations, this approach requires no annotations or other code modifications. The key motivation for our GPU computation back-end is to transform high-level Python/NumPy applications to the lowlevel GPU executable kernels, with the goal of obtaining highperformance, high-productivity and high-portability, HP3. We provide a performance study of the GPU back-end that includes four well-known benchmark applications, Black-Scholes, Successive Over-relaxation, Shallow Water, and N-body, implemented in pure Python/NumPy. We demonstrate an impressive 834 times speed up for the Black-Scholes application, and an average speedup of 124 times across the four benchmarks.

ieee international conference on high performance computing, data, and analytics | 2016

Battling Memory Requirements of Array Programming Through Streaming

Mads Ruben Burgdorff Kristensen; James Emil Avery; Troels Blum; Simon Andreas Frimann Lund; Brian Vinter

A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming, implemented in the automatic parallelization high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers.

international parallel and distributed processing symposium | 2012

PGAS for Distributed Numerical Python Targeting Multi-core Clusters

Mads Ruben Burgdorff Kristensen; Yili Zheng; Brian Vinter

In this paper we propose a parallel programming model that combines two well-known execution models: Single Instruction, Multiple Data (SIMD) and Single Program, Multiple Data (SPMD). The combined model supports SIMD-style data parallelism in global address space and supports SPMD-style task parallelism in local address space. One of the most important features in the combined model is that data communication is expressed by global data assignments instead of message passing. We implement this combined programming model into Python, making parallel programming with Python both highly productive and performing on distributed memory multi-core systems. We base the SIMD data parallelism on DistNumPy, an auto-parallel zing version of the Numerical Python (NumPy) package that allows sequential NumPy programs to run on distributed memory architectures. We implement the SPMD task parallelism as an extension to DistNumPy that enables each process to have direct access to the local part of a shared array. To harvest the multi-core benefits in modern processors we exploit multi-threading in both SIMD and SPMD execution models. The multi-threading is completely transparent to the user -- it is implemented in the runtime with Open MP and by using multi-threaded libraries when available. We evaluate the implementation of the combined programming model with several scientific computing benchmarks using two representative multi-core distributed memory systems -- an Intel Nehalem cluster with Infini band interconnects and a Cray XE-6 supercomputer -- up to 1536 cores. The benchmarking results demonstrate scalable good performance.

ieee international conference on high performance computing data and analytics | 2016

Automatic mapping of array operations to specific architectures

Simon Andreas Frimann Lund; Mads Ruben Burgdorff Kristensen; Brian Vinter

Array-oriented programming has been around for about thirty years and provides a fundamental abstraction for scientific computing. However, a wealth of popular programming languages in existence fail to provide convenient highlevel abstractions and exploit parallelism. One reason being that hardware is an ever-moving target.For this purpose, we introduce CAPE, a C-targeting Array Processing Engine, which manages the concerns of optimizing and parallelizing the execution of array operations. It is intended as a backend for new and existing languages and provides a portable runtime with a C-interface.The performance of the implementation is studied in relation to high-level implementations of a set of applications, kernels and synthetic benchmarks in Python/NumPy as well as lowlevel implementations in C/C++. We show the performance improvement over the high-productivity environment and how close the implementation is to handcrafted C/C++ code.

european conference on parallel processing | 2014

Bypassing the Conventional Software Stack Using Adaptable Runtime Systems

Simon Andreas Frimann Lund; Mads Ruben Burgdorff Kristensen; Brian Vinter; Dimitrios Katsaros

High-level languages such as Python offer convenient language constructs and abstractions for readability and productivity. Such features and Python’s ability to serve as a steering language as well as a self-contained language for scientific computations has made Python a viable choice for high-performance computing. However, the Python interpreter’s reliance on shared objects and dynamic loading causes scalability issues that at large-scale consumes hours of wall-clock time just for loading the interpreter.

Explore More