Bo Joel Svensson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bo Joel Svensson is active.

Explore More

Publication

Featured researches published by Bo Joel Svensson.

declarative aspects and applications of multicore programming | 2012

Expressive array constructs in an embedded GPU kernel programming language

Koen Claessen; Mary Sheeran; Bo Joel Svensson

Graphics Processing Units (GPUs) are powerful computing devices that with the advent of CUDA/OpenCL are becomming useful for general purpose computations. Obsidian is an embedded domain specific language that generates CUDA kernels from functional descriptions. A symbolic array construction allows us to guarantee that intermediate arrays are fused away. However, the current array construction has some drawbacks; in particular, arrays cannot be combined efficiently. We add a new type of push arrays to the existing Obsidian system in order to solve this problem. The two array types complement each other, and enable the definition of combinators that both take apart and combine arrays, and that result in efficient generated code. This extension to Obsidian is demonstrated on a sequence of sorting kernels, with good results. The case study also illustrates the use of combinators for expressing the structure of parallel algorithms. The work presented is preliminary, and the combinators presented must be generalised. However, the raw speed of the generated kernels bodes well.

international conference on functional programming | 2013

Simple and compositional reification of monadic embedded languages

Josef Svenningsson; Bo Joel Svensson

When writing embedded domain specific languages in Haskell, it is often convenient to be able to make an instance of the Monad class to take advantage of the do-notation and the extensive monad libraries. Commonly it is desirable to compile such languages rather than just interpret them. This introduces the problem of monad reification, i.e. observing the structure of the monadic computation. We present a solution to the monad reification problem and illustrate it with a small robot control language. Monad reification is not new but the novelty of our approach is in its directness, simplicity and compositionality.

functional high performance computing | 2014

Defunctionalizing push arrays

Bo Joel Svensson; Josef Svenningsson

Recent work on embedded domain specific languages (EDSLs) for high performance array programming has given rise to a number of array representations. In Feldspar and Obsidian there are two different kinds of arrays, called Pull and Push arrays. Both Pull and Push arrays are deferred; they are methods of computing arrays, rather than elements stored in memory. The reason for having multiple array types is to obtain code that performs better. Pull and Push arrays provide this by guaranteeing that operations fuse automatically. It is also the case that some operations are easily implemented and perform well on Pull arrays, while for some operations, Push arrays provide better implementations. But do we really need to have more than one array representation? In this paper we derive a new array representation from Push arrays that have all the good qualities of Pull and Push arrays combined. This new array representation is obtained via defunctionalization of a Push array API.

ACM Queue | 2014

Design exploration through code-generating DSLs

Bo Joel Svensson; Mary Sheeran; Ryan R. Newton

DSLs (domain-specific languages) make programs shorter and easier to write. They can be stand-alone - for example, LaTeX, Makefiles, and SQL - or they can be embedded in a host language. You might think that DSLs embedded in high-level languages would be abstract or mathematically oriented, far from the nitty-gritty of low-level programming. This is not the case. This article demonstrates how high-level EDSLs (embedded DSLs) really can ease low-level programming. There is no contradiction.

functional high performance computing | 2012

Parallel programming in Haskell almost for free: an embedding of intel's array building blocks

Bo Joel Svensson; Mary Sheeran

Nowadays, performance in processors is increased by adding more cores or wider vector units, or by combining accelerators like GPUs and traditional cores on a chip. Programming for these diverse architectures is a challenge. We would like to exploit all the resources at hand without putting too much burden on the programmer. Ideally, the programmer should be presented with a machine model abstracted from the specific number of cores, SIMD width or the existence of a GPU or not. Intels Array Building Blocks (ArBB) is a system that takes on these challenges. ArBB is a language for data parallel and nested data parallel programming, embedded in C++. By offering a retargetable dynamic compilation framework, it provides vectorisation and threading to programmers without the need to write highly architecture specific code. We aim to bring the same benefits to the Haskell programmer by implementing a Haskell frontend (embedding) of the ArBB system. We call this embedding EmbArBB. We use standard Haskell embedded language procedures to provide an interface to the ArBB functionality in Haskell. EmbArBB is work in progress and does not currently support all of the ArBB functionality. Some small programming examples illustrate how the Haskell embedding is used to write programs. ArBB code is short and to the point in both C++ and Haskell. Matrix multiplication has been benchmarked in sequential C++, ArBB in C++, EmbArBB and the Repa library. The C++ and the Haskell embeddings have almost identical performance, showing that the Haskell embedding does not impose any large extra overheads. Two image processing algorithms have also been benchmarked against Repa. In these benchmarks at least, EmbArBB performance is much better than that of the Repa library, indicating that building on ArBB may be a cheap and easy approach to exploiting data parallelism in Haskell.

functional high performance computing | 2013

Counting and occurrence sort for GPUs using an embedded language

Josef Svenningsson; Bo Joel Svensson; Mary Sheeran

This paper investigates two sorting algorithms: counting sort and a variation, occurrence sort, which also removes duplicate elements, and examines their suitability for running on the GPU. The duplicate removing variation turns out to have a natural functional, data-parallel implementation which makes it particularly interesting for GPUs. The algorithms are implemented in Obsidian, a high-level domain specific language for GPU programming. Measurements show that our implementations in many cases outperform the sorting algorithm provided by the library Thrust. Furthermore, occurrence sort is another factor of two faster than ordinary counting sort. We conclude that counting sort is an important contender when considering sorting algorithms for the GPU, and that occurrence sort is highly preferable when applicable. We also show that Obsidian can produce very competitive code.

Journal of Functional Programming | 2016

A language for hierarchical data parallel design-space exploration on GPUs

Bo Joel Svensson; Ryan R. Newton; Mary Sheeran

Graphics Processing Units (GPUs) offer potential for very high performance; they are also rapidly evolving. Obsidian is an embedded language (in Haskell) for implementing high performance kernels to be run on GPUs. We would like to have our cake and eat it too; we want to raise the level of abstraction beyond CUDA code and still give the programmer control over the details relevant to kernel performance. To that end, Obsidian provides array representations that guarantee elimination of intermediate arrays while also using the type system to model the hierarchy of the GPU. Operations are compiled very differently depending on what level of the GPU they target, and as a result, the user is gently constrained to write code that matches the capabilities of the GPU. Thus, we implement not Nested Data Parallelism, but a more limited form that we call Hierarchical Data Parallelism. We walk through case-studies that demonstrate how to use Obsidian for rapid design exploration or auto-tuning, resulting in performance that compares well to the hand-tuned kernels used in Accelerate and NVIDIA Thrust.

functional high performance computing | 2015

Meta-programming and auto-tuning in the search for high performance GPU code

Michael Vollmer; Bo Joel Svensson; Eric Holk; Ryan R. Newton

Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code.

functional high performance computing | 2016

Low-level functional GPU programming for parallel algorithms

Martin Dybdal; Martin Elsman; Bo Joel Svensson; Mary Sheeran

We present a Functional Compute Language (FCL) for low-level GPU programming. FCL is functional in style, which allows for easy composition of program fragments and thus easy prototyping and a high degree of code reuse. In contrast with projects such as Futhark, Accelerate, Harlan, Nessie and Delite, the intention is not to develop a language providing fully automatic optimizations, but instead to provide a platform that supports absolute control of the GPU computation and memory hierarchies. The developer is thus required to have an intimate knowledge of the target platform, as is also required when using CUDA/OpenCL directly. FCL is heavily inspired by Obsidian. However, instead of relying on a multi-staged meta-programming approach for kernel generation using Haskell as meta-language, FCL is completely self-contained, and we intend it to be suitable as an intermediate language for data-parallel languages, including data-parallel parts of high-level array languages, such as R, Matlab, and APL. We present a type-system and a dynamic semantics suitable for understanding the performance characteristics of both FCL and Obsidian-style programs. Our aim is that FCL will be useful as a platform for developing new parallel algorithms, as well as a target-language for various code-generators targeting GPU hardware.

programming language design and implementation | 2017

Instruction punning: lightweight instrumentation for x86-64

Buddhika Chamith; Bo Joel Svensson; Luke Dalessandro; Ryan R. Newton

Existing techniques for injecting probes into running applications are limited; they either fail to support probing arbitrary locations, or to support scalable, rapid toggling of probes. We introduce a new technique on x86-64, called instruction punning, which allows scalable probes at any instruction. The key idea is that when we inject a jump instruction, the relative address of the jump serves simultaneously as data and as an instruction sequence. We show that this approach achieves probe invocation overheads of only a few dozen cycles, and probe activation/deactivation costs that are cheaper than a system call, even when all threads in the system are both invoking probes and toggling them.

Explore More