Håkan Grahn
Blekinge Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Håkan Grahn.
Journal of Parallel and Distributed Computing | 1989
Magnus Broberg; Lars Lundberg; Håkan Grahn
Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the completion time. In the case of a parallel program executed on a multiprocessor this is not always true, due to dependencies between the different threads. Thus, certain code segments of the execution may not affect the completion time of the program. Optimization of such code segments will not increase the performance. In this paper we present an approach to optimize performance by finding the extended critical path of the multithreaded program. The extended critical path analysis is a generalization of the critical path analysis in the sense that it also deals with more threads than processors. We have implemented the extended critical path analysis in a performance optimization tool. The tool allows the user to determine the extended critical path of a multithreaded application written for the Solaris operating system for any number of processors based on execution on a single processor workstation.
IEEE Computer Architecture Letters | 2010
Syed Muhammad Zeeshan Iqbal; Yuchen Liang; Håkan Grahn
Multicore processors are the main computing platform in laptops, desktop, and servers today, and are making their way into the embedded systems market also. Using benchmarks is a common approach to evaluate the performance of a system. However, benchmarks for embedded systems have so far been either targeted for a uni-processor environment, e.g., MiBench,or have been commercial, e.g., MultiBench by EEMBC. In this paper, we propose and implement an open source benchmark,ParMiBench, targeted for multiprocessor-based embedded systems. ParMiBench consists of parallel implementations of seven compute intensive algorithms from the uni-processor benchmark suite MiBench. The applications are selected from four domains:Automation and Industry Control, Network, Office, and Security.
Future Generation Computer Systems | 1995
Håkan Grahn; Per Stenström; Michel Dubois
Abstract Invalidation-based cache coherence protocols have been extensively studied in the context of large-scale shared-memory multiprocessors. Under a relaxed memory consistency model, most of the write latency can be hidden whereas cache misses still incur a severe performance problem. By contrast, update-based protocols have a potential to reduce both write and read penalties under relaxed memory consistency models because coherence misses can be completely eliminated. The purpose of this paper is to compare update- and invalidation-based protocols for their ability to reduce or hide memory access latencies and for their ease of implementation under relaxed memory consistency models. Based on a detailed simulation study, we find that write-update protocols augmented with simple competitive mechanisms — we call such protocols competitive-update protocols — can hide all the write latency and cut the read penalty by as much as 46% at the cost of some increase in the memory traffic. However, as compared to write-invalidate, update-based protocols require more aggressive memory consistency models and more local buffering in the second-level cache to be effective. In addition, their increased number of global writes may cause increased synchronization overhead in applications with high contention for critical sections.
Journal of Systems and Software | 2007
Piotr Tomaszewski; Jim Håkansson; Håkan Grahn; Lars Lundberg
Statistical fault prediction models and expert estimations are two popular methods for deciding where to focus the fault detection efforts when the fault detection budget is limited. In this paper, we present a study in which we empirically compare the accuracy of fault prediction offered by statistical prediction models with the accuracy of expert estimations. The study is performed in an industrial setting. We invited eleven experts that are involved in the development of two large telecommunication systems. Our statistical prediction models are built on historical data describing one release of one of those systems. We compare the performance of these statistical fault prediction models with the performance of our experts when predicting faults in the latest releases of both systems. We show that the statistical methods clearly outperform the expert estimations. As the main reason for the superiority of the statistical models we see their ability to cope with large datasets. This makes it possible for statistical models to perform reliable predictions for all components in the system. This also enables prediction at a more fine-grain level, e.g., at the class instead of at the component level. We show that such a prediction is better both from the theoretical and from the practical perspective.
international parallel processing symposium | 1999
Magnus Broberg; Lars Lundberg; Håkan Grahn
Efficient performance tuning of parallel programs is often hard. We present a performance prediction and visualization tool called VPPB. Based on a monitored uni-processor execution, VPPB shows the (predicted) behaviour of a multithreaded program using any number of processors and the program behaviour is visualized as a graph. The first version of VPPB was unable to handle I/O operations. This version has, by an improved tracing technique, added the possibility to trace activities at the kernel level as well. Thus, VPPB is now able to trace various I/O activities, e.g., manipulation of OS internal buffers, physical disk I/O, socket I/O, and RPC. VPPB allows flexible performance tuning of parallel programs developed for shared memory multiprocessors using a standardized environment; C/C++ programs that uses the thread package in Solaris 2.X.
Journal of Parallel and Distributed Computing | 1996
Håkan Grahn; Per Stenström
Although directory-based write-invalidate cache coherence protocols have a potential to improve the performance of large-scale multiprocessors, coherence misses limit the processor utilization. Therefore, so-called competitive-update protocols?hybrid protocols that on a per-block basis dynamically switch between write-invalidate and write-update?have been considered as a means to reduce the coherence miss rate and have been shown to be a better coherence policy for a wide range of applications. Unfortunately, such protocols may cause high traffic peaks for applications with extensive use of migratory objects. These traffic peaks can offset the performance gain of a reduced miss rate if the network bandwidth is not sufficient. We propose in this study to extend a competitive-update protocol with a previously published adaptive mechanism that can dynamically detect migratory objects and reduce the coherence traffic they cause. Detailed architectural simulations based on five scientific and engineering applications show that this adaptive protocol outperforms a write-invalidate protocol by reducing the miss rate and bandwidth needed by up to 71 and 26%, respectively.
acs/ieee international conference on computer systems and applications | 2011
Håkan Grahn; Niklas Lavesson; Mikael Hellborg Lapajne; Daniel Slat
Machine learning algorithms are frequently applied in data mining applications. Many of the tasks in this domain concern high-dimensional data. Consequently, these tasks are often complex and computationally expensive. This paper presents a GPU-based parallel implementation of the Random Forests algorithm. In contrast to previous work, the proposed algorithm is based on the compute unified device architecture (CUDA). An experimental comparison between the CUDA-based algorithm (CudaRF), and state-of-the-art Random Forests algorithms (Fas-tRF and LibRF) shows that CudaRF outperforms both FastRF and LibRF for the studied classification task.
acs/ieee international conference on computer systems and applications | 2011
Jan Kasper Martinsen; Håkan Grahn
JavaScript has gone from being a mechanism for providing dynamic web pages to an important component of many web applications. Currently one of the most popular type of web applications is so-called social networks, e.g., Facebook, Twitter, and MySpace. However, the workload and execution behavior of JavaScript in this context have not been fully explored or understood. In this paper we present a methodology for characterizing the JavaScript execution behavior in interactive web applications using deterministic execution of use cases. Then, we apply this methodology to evaluate a set of social network applications and compare their behavior to a set of established JavaScript benchmarks. Our results confirm previous studies that the execution behavior of social networks differ from established benchmarks. In addition, we identify one novel difference not published before, i.e., the use of anonymous functions in web applications.
international symposium on computer architecture | 1995
Håkan Grahn; Per Stenström
The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important performance limitation of such software-only protocols is that software latency associated with directory management ends up on the critical memory access path for read miss transactions. We propose five strategies that support efficient data transfers in hardware whereas directory management is handled at a slower pace in the background by software handlers. Simulations show that this approach can remove the directory-management latency from the memory access path. Whereas the directory is managed in software, the hardware mechanisms must access the memory state in order to enable data transfers at a high speed. Overall, our strategies reach between 60% and 86% of the hardware-based protocol performance.
IEEE Computer | 1997
Per Stenström; Mats Brorsson; Fredrik Dahlgren; Håkan Grahn; Michel Dubois
Proposed hardware optimizations to CC-NUMA machines-shared memory multiprocessors that use cache consistency protocols-can shorten the time processors lose because of cache misses and invalidations. The authors look at cost-performance trade-offs for each.