Magnus Broberg
Blekinge Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Magnus Broberg.
Journal of Parallel and Distributed Computing | 1989
Magnus Broberg; Lars Lundberg; Håkan Grahn
Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the completion time. In the case of a parallel program executed on a multiprocessor this is not always true, due to dependencies between the different threads. Thus, certain code segments of the execution may not affect the completion time of the program. Optimization of such code segments will not increase the performance. In this paper we present an approach to optimize performance by finding the extended critical path of the multithreaded program. The extended critical path analysis is a generalization of the critical path analysis in the sense that it also deals with more threads than processors. We have implemented the extended critical path analysis in a performance optimization tool. The tool allows the user to determine the extended critical path of a multithreaded application written for the Solaris operating system for any number of processors based on execution on a single processor workstation.
international parallel processing symposium | 1999
Magnus Broberg; Lars Lundberg; Håkan Grahn
Efficient performance tuning of parallel programs is often hard. We present a performance prediction and visualization tool called VPPB. Based on a monitored uni-processor execution, VPPB shows the (predicted) behaviour of a multithreaded program using any number of processors and the program behaviour is visualized as a graph. The first version of VPPB was unable to handle I/O operations. This version has, by an improved tracing technique, added the possibility to trace activities at the kernel level as well. Thus, VPPB is now able to trace various I/O activities, e.g., manipulation of OS internal buffers, physical disk I/O, socket I/O, and RPC. VPPB allows flexible performance tuning of parallel programs developed for shared memory multiprocessors using a standardized environment; C/C++ programs that uses the thread package in Solaris 2.X.
ieee international conference on high performance computing data and analytics | 2003
Lars Lundberg; Magnus Broberg; Kamilla Klonowska
Most cluster systems used in high performance computing do not allow process relocation at run-time. Finding an allocation that results in minimal completion time is NP-hard and (non-optimal) heuristic algorithms have to be used. One major drawback with heuristics is that we do not know if the result is close to optimal or not. Here, we present a method for finding an upper bound on the minimal completion time for a given program. The bound helps the user to determine when it is worth-while to continue the heuristic search for better allocations. Based on some parameters derived from the program, as well as some parameters describing the hardware platform, the method produces the minimal completion time bound. A practical demonstration of the method is presented using a tool that produces the bound.
european conference on parallel processing | 2001
Magnus Broberg; Lars Lundberg; Håakan Grahn
Many multiprocessor systems are based on distributed shared memory. It is often important to statically bind threads to processors in order to avoid remote memory access, due to performance. Finding a good allocation takes long time and it is hard to know when to stop searching for a better one. It is sometimes impossible to run the application on the target machine. The developer needs a tool that finds the good allocations without the target multiprocessor. We present a tool that uses a greedy algorithm and produces allocations that are more than 40% faster (in average) than when using a binpacking algorithm. The number of allocations to be evaluated can be reduced by 38% with a 2% performance loss. Finally, an algorithm is proposed that is promising in avoiding local maxima.
ieee international conference on high performance computing data and analytics | 2004
Lars Lundberg; Magnus Broberg; Kamilla Klonowska
Many parallel systems used in high-performance computing do not allow process relocation at run-time. It is thus important to find a good allocation of processes to processors. As the problem of finding an allocation that results in minimal completion time is NP-hard, one has to resort to heuristic algorithms for finding good allocations. One major drawback with heuristic algorithms is that we do not know whether the result is close to optimal or it is worthwhile to continue the heuristic search for better allocations. In this paper, we present a method for finding an upper bound on the minimal completion time for a given program. If the completion time using the current allocation is above this bound, we know that it is worthwhile to continue the search for better allocations. The bound, which is optimally tight using the available information, is based on some parameters derived from the program and describing the hardware platform. A practical demonstration of the method is presented using a tool that produces the bound for multithreaded C-programs executing in a parallel Sun/Solaris environment.
The Computer Journal | 2004
Kamilla Klonowska; Lars Lundberg; Håkan Lennerstad; Magnus Broberg
Consider a parallel program with n processes and a synchronization granularity z. Consider also two parallel architectures: an SMP with q processors and run-time reallocation of processes to processors, and a distributed system (or cluster) with k processors and no run-time reallocation. There is an inter-processor communication delay of t time units for the system with no run-time reallocation. In this paper we define a function H(n,k,q,t,z) such that the minimum completion time for all programs with n processes and a granularity z is at most H(n,k,q,t,z) times longer using the system with no reallocation and k processors compared to using the system with q processors and run-time reallocation. We assume optimal allocation and scheduling of processes to processors. The function H(n,k,q,t,z)is optimal in the sense that there is at least one program, with n processes and a granularity z, such that the ratio is exactly H(n,k,q,t,z). We also validate our results using measurements on distributed and multiprocessor Sun/Solaris environments. The function H(n,k,q,t,z) provides important insights regarding the performance implications of the fundamental design decision of whether to allow run-time reallocation of processes or not. These insights can be used when doing the proper cost/benefit trade-offs when designing parallel execution platforms.
merged international parallel processing symposium and symposium on parallel and distributed processing | 1998
Magnus Broberg; Lars Lundberg; Håkan Grahn
Applied Informatics | 2003
Lars Lundberg; Kamilla Klonowska; Magnus Broberg; Håkan Lennerstad
iasted international conference on parallel and distributed computing and systems | 2002
Magnus Broberg; Lars Lundberg; Håkan Grahn
Archive | 2002
Magnus Broberg; Lars Lundberg; Kamilla Klonowska