Artur Podobas
Royal Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Artur Podobas.
acm sigplan symposium on principles and practice of parallel programming | 2016
Ananya Muddukrishna; Peter A. Jonsson; Artur Podobas; Mats Brorsson
Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime systems perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains -- computation performed by a task or a parallel for-loop chunk instance -- and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.
Concurrency and Computation: Practice and Experience | 2015
Artur Podobas; Mats Brorsson; Karl-Filip Faxén
Programmers today face a bewildering array of parallel programming models and tools, making it difficult to choose an appropriate one for each application. An increasingly popular programming model supporting structured parallel programming patterns in a portable and composable manner is the task‐centric programming model. In this study, we compare several popular task‐centric programming frameworks, including Cilk Plus, Threading Building Blocks, and various implementations of OpenMP 3.0. We have analyzed their performance on the Barcelona OpenMP Tasking Suite benchmark suite both on a 48‐core AMD Opteron 6172 server and a 64‐core TILEPro64 embedded many‐core processor. Our results show that the OpenMP offers the highest flexibility for programmers, and this flexibility comes to a cost. Frameworks supporting only a specific and more restrictive model, such as Cilk Plus and Threading Building Blocks, are generally more efficient both in terms of performance and energy consumption. However, Intels implementation of OpenMP tasks performs the best and closest to the specialized run‐time systems. Copyright
international workshop on openmp | 2014
Artur Podobas; Mats Brorsson; Vladimir Vlassov
Data-driven task-parallelism is attracting growing interest and has now been added to OpenMP (4.0). This paradigm simplifies the writing of parallel applications, extracting parallelism, and facilitates the use of distributed memory architectures. While the programming model itself is becoming mature, a problem with current run-time scheduler implementations is that they require a very large task granularity in order to scale. This limitation goes at odds with the idea of task-parallel programing where programmers should be able to concentrate on exposing parallelism with little regard to the task granularity. To mitigate this limitation, we have designed and implemented TurboBŁYSK, a highly efficient run-time scheduler of tasks with explicit data-dependence annotations. We propose a novel mechanism based on pattern-saving that allows the scheduler to re-use previously resolved dependency patterns, based on programmer annotations, enabling programs to use even the smallest of tasks and scale well. We experimentally show that our techniques in TurboBŁYSK enable achieving nearly twice the peak performance compared with other run-time schedulers. Our techniques are not OpenMP specific and can be implemented in other task-parallel frameworks.
international conference on parallel processing | 2012
Ananya Muddukrishna; Artur Podobas; Mats Brorsson; Vladimir Vlassov
Modern manycore processors feature a highly scalable and software-configurable cache hierarchy. For performance, manycore programmers will not only have to efficiently utilize the large number of cores but also understand and configure the cache hierarchy to suit the application. Relief from this manycore programming nightmare can be provided by task-based programming models where programmers parallelize using tasks and an architecture-specific runtime system maps tasks to cores and in addition configures the cache hierarchy. In this paper, we focus on the cache hierarchy of the Tilera TILEPro64 processor which features a software-configurable coherence waypoint called the home cache. We first show the runtime system performance bottleneck of scheduling tasks oblivious to the nature of home caches. We then demonstrate a technique in which the runtime system controls the assignment of home caches to memory blocks and schedules tasks to minimize home cache access penalties. Test results of our technique have shown a significant execution time performance improvement on selected benchmarks leading to the conclusion that by taking processor architecture features into account, task-based programming models can indeed provide continued performance and allow programmers to smoothly transit from the multicore to manycore era.
international conference on parallel processing | 2012
Artur Podobas; Mats Brorsson; Vladimir Vlassov
Computer architecture technology is moving towards more heterogeneous solutions, which will contain a number of processing units with different capabilities that may increase the performance of the system as a whole. However, with increased performance comes increased complexity; complexity that is now barely handled in homogeneous multiprocessing systems. The present study tries to solve a small piece of the heterogeneous puzzle; how can we exploit all system resources in a performance-effective and user-friendly way? Our proposed solution includes a run-time system capable of using a variety of different heterogeneous components while providing the user with the already familiar task-centric programming model interface. Furthermore, when dealing with non-uniform workloads, we show that traditional approaches based on centralized or work-stealing queue algorithms do not work well and propose a scheduling algorithm based on trend analysis to distribute work in a performance-effective way across resources.
2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs | 2014
Artur Podobas
The task-based programming paradigm offers a portable way of writing parallel applications. However, it requires tedious tuning of the application for performance. We present a novel design flow where programmers can use application knowledge to easily generate a System-on-Chip (SoC) specialized in executing the application. Our design flow uses a compiler that automatically generates task-specific cores and packs them into a custom SoC. A SoC-specific runtime systems schedules tasks on cores to accelerate application execution. The generated SoC shows up to 6000 times performance improvement in comparison to the Altera NiosII/s processor and up to 7 times compared to an AMD Opteron 6172 core. Our design flow helps programmers generate high-performance systems without requiring tuning and prior hardware design knowledge.
international conference on embedded computer systems architectures modeling and simulation | 2016
Artur Podobas; Mats Brorsson
OpenMP enables productive software development that targets shared-memory general purpose systems. However, OpenMP compilers today have little support for future heterogeneous systems — systems that will more than likely contain Field Programmable Gate Arrays (FPGAs) to compensate for the lack of parallelism available in general purpose systems. We have designed a high-level synthesis flow that automatically generates parallel hardware from unmodified OpenMP programs. The generated hardware is composed of accelerators tailored to act as hardware instances of the OpenMP task primitive. We drive decision making of complex details within accelerators through a constraint-programming model, minimizing the expected input from the (often) hardware-oblivious software developer. We evaluate our system and compare them to two state of the art architectures — the Xeon PHI and the AMD Opteron — where we find our accelerators to perform on par with the two ASIC processors.
international workshop on openmp | 2015
Lars Frydendal Bonnichsen; Artur Podobas
OpenMP applications with abundant parallelism are often characterized by their high-performance. Unfortunately, OpenMP applications with a lot of synchronization or serialization-points perform poorly because of blocking, i.e. the threads have to wait for each other. In this paper, we present methods based on hardware transactional memory (HTM) for executing OpenMP barrier, critical, and taskwait directives without blocking. Although HTM is still relatively new in the Intel and IBM architectures, we experimentally show a 73 % performance improvement over traditional locking approaches, and 23 % better than other HTM approaches on critical sections. Speculation over barriers can decrease execution time by up-to 41 %. We expect that future systems with HTM support and more cores will have a greater benefit from our approach as they are more likely to block.
international workshop on openmp | 2016
Artur Podobas; Sven Karlsson
OpenMP 4.5 introduced a task-parallel version of the classical thread-parallel for-loop construct: the taskloop construct. With this new construct, programmers are given the opportunity to choose between the two parallel paradigms to parallelize their for loops. However, it is unclear where and when the two approaches should be used when writing efficient parallel applications.
3rd Workshop on Programmability Issues for Multi-Core Computers | 2010
Artur Podobas; Mats Brorsson; Karl-Filip Faxén