Kjell Winblad
Uppsala University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kjell Winblad.
IEEE Transactions on Parallel and Distributed Systems | 2018
David Klaftenegger; Konstantinos F. Sagonas; Kjell Winblad
The scalability of parallel programs is often bounded by the performance of synchronization mechanisms used to protect critical sections. The performance of these mechanisms is in turn determined by their sequential execution time, efficient use of hardware, and ability to avoid waiting. In this article, we describe queue delegation (QD) locking, a family of locks that both delegate critical sections and enable detaching execution. Threads delegate work to the thread currently holding the lock and are able to detach, i.e., immediately continue their execution until they need a result from a previously delegated critical section. We show how to use queue delegation to build synchronization algorithms with lower overhead and higher throughput than existing algorithms, even when critical sections need to communicate results back immediately. Experiments when using up to 64 threads to access a shared priority queue show that QD locking provides 10 times higher throughput than Pthreads mutex locks and outperforms leading delegation algorithms. Also, when mixing parallel reads with delegated write operations, QD locking outperforms competing algorithms with an advantage ranging from 9.5 up to 207 percent increased throughput. Last but not least, continuing execution instead of waiting for the execution of critical sections leads to increased parallelism and better scalability. As we will see, queue delegation locking uses simple building blocks whose overhead is low even in uncontended use. All these make the technique useful in a wide variety of applications.
european conference on parallel processing | 2014
David Klaftenegger; Konstantinos F. Sagonas; Kjell Winblad
While standard locking libraries are common and easy to use, delegation algorithms that offload work to a single thread can achieve better performance in multithreaded applications, but are hard to use without adequate library support. This paper presents an interface for delegation locks together with libraries for C and C++ that make it easy to use queue delegation locking, a versatile high-performance delegation algorithm. We show examples of using these libraries, discuss the porting effort needed to take full advantage of delegation locking in applications designed with standard locking in mind, and the improved performance that this achieves.
languages and compilers for parallel computing | 2015
Konstantinos F. Sagonas; Kjell Winblad
We extend contention adapting trees CA trees, a family of concurrent data structures for ordered sets, to support linearizable range queries, range updates, and operations that atomically operate on multiple keys such as bulk insertions and deletions. CA trees differ from related concurrent data structures by adapting themselves according to the contention level and the access patterns to scale well in a multitude of scenarios. Variants of CA trees with different performance characteristics can be derived by changing their sequential component. We experimentally compare CA trees to state-of-the-art concurrent data structures and show that CA trees beat the best data structures we compare against with upi¾?to 57i¾?% in scenarios that contain basic set operations and range queries, and outperform them by more than 1200i¾?% in scenarios that also contain range updates.
acm symposium on parallel algorithms and architectures | 2014
David Klaftenegger; Konstantinos F. Sagonas; Kjell Winblad
The scalability of parallel programs is often bounded by the performance of synchronization mechanisms used to protect critical sections. The performance of these mechanisms is in turn determined by their ability to use modern hardware efficiently and do useful work while or instead of waiting. This brief announcement sketches the idea and implementation of queue delegation locking, a synchronization mechanism that provides high throughput by allowing threads to efficiently delegate their critical sections to the thread currently holding the lock and by allowing threads that do not need a result from their critical section to continue executing immediately after delegating their work. Experiments show that queue delegation locking outperforms leading synchronization mechanisms due to the combination of its fast operation transfer with its ability to allow threads to continue doing useful work instead of waiting. Thanks to its simple building blocks, even its uncontended overhead is low, making queue delegation locking useful in a wide variety of applications.
international symposium on parallel and distributed computing | 2015
Konstantinos F. Sagonas; Kjell Winblad
With multicourse being ubiquitous, concurrent data structures are becoming increasingly important. This paper proposes a novel approach to concurrent data structure design where the data structure collects statistics about contention and adapts dynamically according to this statistics. We use this approach to create a contention adapting binary search tree (CA tree) that can be used to implement concurrent ordered sets and maps. Our experimental evaluation shows that CA trees scale similar to recently proposed algorithms on a big multicore machine on various scenarios with a larger set size, and outperform the same data structures in more contended scenarios and in sequential performance. We also show that CA trees are well suited for optimization with hardware lock elision. In short, we propose a practically useful and easy to implement and show correct concurrent search tree that naturally adapts to the level of contention.
annual erlang workshop | 2014
Konstantinos F. Sagonas; Kjell Winblad
The Erlang Term Storage (ETS) is a key component of the runtime system and standard library of Erlang/OTP. In particular, on big multicores, the performance of many applications that use ETS as a shared key-value store heavily depends on the scalability of ETS. In this work, we investigate an alternative implementation for the ETS table type ordered_set based on a contention adapting search tree. The new implementation performs many times better than the current one in contended scenarios and scales better than the ETS table types implemented using hashing and fine-grained locking when several processor chips are used. We evaluate the new implementation with a set of experiments that show its scalability in relation to the current ETS implementation as well as its low sequential overhead.
Journal of Parallel and Distributed Computing | 2017
Konstantinos F. Sagonas; Kjell Winblad
With multicores being ubiquitous, concurrent data structures are increasingly important. This article proposes a novel approach to concurrent data structure design where the data structure dynamica ...
ACM Transactions on Programming Languages and Systems | 2017
Phil Trinder; Natalia Chechina; Nikolaos Papaspyrou; Konstantinos F. Sagonas; Simon J. Thompson; Stephen Adams; Stavros Aronis; Robert Baker; Eva Bihari; Olivier Boudeville; Francesco Cesarini; Maurizio Di Stefano; Sverker Eriksson; Viktória fördős; Amir Ghaffari; Aggelos Giantsios; Rickard Green; Csaba Hoch; David Klaftenegger; Huiqing Li; Kenneth Lundin; Kenneth MacKenzie; Katerina Roukounaki; Yiannis Tsiouris; Kjell Winblad
Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang programming language has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model. We systematically study the scalability limits of Erlang and then address the issues at the virtual machine, language, and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large-scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug, we have developed and made open source releases of five complementary tools, some specific to SD Erlang. Throughout the article we use two case studies to investigate the capabilities of our new technologies and tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6,144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however, scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts, 6,144 cores).
languages and compilers for parallel computing | 2016
Konstantinos F. Sagonas; Kjell Winblad
Efficient and scalable concurrent priority queues are crucial for the performance of many multicore applications, e.g. for task scheduling and the parallelization of various algorithms. Linearizable concurrent priority queues with traditional semantics suffer from an inherent sequential bottleneck in the head of the queue. This bottleneck is the motivation for some recently proposed priority queues with more relaxed semantics. We present the contention avoiding concurrent priority queue (CA-PQ), a data structure that functions as a linearizable concurrent priority with traditional semantics under low contention, but activates contention avoiding techniques that give it more relaxed semantics when high contention is detected. CA-PQ avoids contention in the head of the queue by removing items in bulk from the global data structure, which also allows it to often serve DelMin operations without accessing memory that is modified by several threads. We show that CA-PQ scales well. Its cache friendly design achieves performance that is twice as fast compared to that of state-of-the-art concurrent priority queues on several instances of a parallel shortest path benchmark.
acm symposium on parallel algorithms and architectures | 2018
Kjell Winblad; Konstantinos F. Sagonas; Bengt Jonsson
Concurrent key-value stores with range query support are crucial for the scalability and performance of many applications. Existing lock-free data structures of this kind use a fixed synchronization granularity. Using a fixed synchronization granularity in a concurrent key-value store with range query support is problematic as the best performing synchronization granularity depends on a number of factors that are difficult to predict, such as the level of contention and the number of items that are accessed by range queries. We present the first lock-free key-value store with linearizable range query support that dynamically adapts its synchronization granularity. This data structure is called the lock-free contention adapting search tree (LFCA tree). An LFCA tree does local adaptations of its synchronization granularity based on heuristics that take contention and the performance of range queries into account. We show that the operations of LFCA trees are linearizable, that the lookup operation is wait-free, and that the remaining operations (insert, remove and range query) are lock-free. Our experimental evaluation shows that LFCA trees achieve more than twice the throughput of related lock-free data structures in many scenarios. Furthermore, LFCA trees are able to perform substantially better than data structures with a fixed synchronization granularity over a wide range of scenarios due to their ability to adapt to the scenario at hand.