Bjoern Franke
University of Edinburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bjoern Franke.
international conference on hardware/software codesign and system synthesis | 2009
Daniel Powell; Bjoern Franke
Functional instruction set simulators perform instruction-accurate simulation of benchmarks at high instruction rates. Unlike their slower, but cycle-accurate counterparts however, they are not capable of providing cycle counts due to the higher level of hardware abstraction. In this paper we present a novel approach to performance prediction based on statistical machine learning utilizing a hybrid instruction- and cycle-accurate simulator. We introduce the concept of continuous machine learning to simulation whereby new training data points are acquired on demand and used for on-the-fly updates of the performance model. Furthermore, we show how statistical regression can be adapted to reduce the cost of these updates during a performance-critical simulation. For a state-of-the-art simulator modeling the ARC 750D embedded processor we demonstrate that our approach is highly accurate, with average error <2.5% while achieving a speed-up of approx. 50% over the baseline cycle-accurate simulation.
ieee international symposium on workload characterization | 2014
Volker Seeker; Pavlos Petoumenos; Hugh Leather; Bjoern Franke
Mobile computing devices such as smartphones and tablets have become tightly integrated with many peoples life, both at work and at home. Users spend large amounts of time interacting with their mobile device and demand an excellent user experience in terms of responsiveness, whilst simultaneously expecting a long battery life between charging cycles. Frequency governors, responsible for increasing or decreasing the Cpu clock frequency depending on the current workload and external events, try to balance the two contrasting goals of high performance and low energy consumption. However, despite their critical role in providing energy efficiency it is difficult to measure the effectiveness of frequency governors in an interactive environment. In this paper we develop a novel methodology for creating repeatable, fully automated, realistic, workloads that can accurately measure time lag in interactive applications resulting from non-optimally selected operating frequencies. We also introduce a new metric capturing the user experience for different Android frequency governors. We evaluate interactive workloads to demonstrate how our approach enables us to automatically record and replay sequences of user interactions for different system configurations. We demonstrate that none of the available Android frequency governors performs particularly well, but leave substantial room for improvement. We show that energy savings of up to 27% are possible, whilst delivering a user experience that is better than that provided by the standard Android frequency governor. We also show that it is possible to save 47% energy with performance that is indistinguishable from permanently running the Cpu at the highest frequency.
languages compilers and tools for embedded systems | 2015
Stanislav Manilov; Bjoern Franke; Anthony James Magrath; Cedric Andrieu
Short-vector SIMD and DSP instructions are popular extensions to common Isas. These extensions deliver excellent performance and compact code for some compute-intensive applications, but they require specialised compiler support. To enable the programmer to explicitly request the use of such an instruction, many C compilers provide platform-specific intrinsic functions, whose implementation is handled specially by the compiler. The use of such intrinsics, however, inevitably results in non-portable code. In this paper we develop a novel methodology for retargeting such non-portable code, which maps intrinsics from one platform to another, taking advantage of similar intrinsics on the target platform. We employ a description language to specify the signature and semantics of intrinsics and perform graph-based pattern matching and high-level code transformations to derive optimised implementations exploiting the targets intrinsics, wherever possible. We demonstrate the effectiveness of our new methodology, implemented in the FREE RIDER tool, by automatically retargeting benchmarks derived from OpenCV samples and a complex embedded application optimised to run on an Arm Cortex-M4 to an Intel Edison module with Sse4.2 instructions. We achieve a speedup of up to 3.73 over a plain C baseline, and on average 96.0% of the speedup of manually ported and optimised versions of the benchmarks.
international conference on embedded computer systems architectures modeling and simulation | 2015
Luna Backes; Alejandro Rico; Bjoern Franke
Computer vision (CV) is widely expected to be the next big thing in mobile computing. The availability of a camera and a large number of sensors in mobile devices will enable CV applications that understand the environment and enhance peoples lives through augmented reality. One of the problems yet to solve is how to transfer demanding state-of-the-art CV algorithms -designed to run on powerful desktop computers with several GPUs- onto energy-efficient, but slow, processors and GPUs found in mobile devices. To accommodate to the lack of performance, current CV applications for mobile devices are simpler versions of more complex algorithms, which generally run slowly and unreliably and provide a poor user experience. In this paper, we investigate ways to speed up demanding CV applications to run faster on mobile devices. We selected KinectFusion (KF) as a representative CV application. The KF application constructs a 3D model from the images captured by a Kinect. After porting it to an ARM platform, we applied several optimisation and parallelisation techniques using OpenCL to exploit all the available computing resources. We evaluated the impact on performance and power and demonstrate a 4× speedup with just a 1.38× power increase. We also evaluated the performance portability of our optimisations by running on a different platform, and assessed similar improvements despite the different multi-core configuration and memory system. By measuring processor temperature, we found overheating to be the main limiting factor for running such high-performance codes on a mobile device not designed for full continuous utilisation.
Springer US | 2010
Nigel P. Topham; Bjoern Franke; Daniel Jones; Daniel Powell
Instruction set simulators are essential tools in all forms of microprocessor design; simulators play a key role in activities ranging from ASIP design-space exploration to hardware–software co-verification and software development. Simulation speed is the primary concern when functional simulators are used as CPU emulators for software development. Conversely, the ability to measure performance is of critical importance during the exploratory phases of co-design, whereas the ability to use a simulator as a golden reference model is important for hardware–software co-verification. A key challenge is to provide the highest level of performance, for the different observability and performance measuring demands of each use-case. In this chapter, we describe an adaptive simulator designed to meet these diverse requirements. Adaptation takes two forms: first, the simulator has a high-speed JIT compilation capability allowing it to be extended dynamically according to simulated program behavior; and second, it is able to learn how to model the timing behavior of the target processor and thereby deliver approximate performance figures with very low overhead. The simulator maintains a precise model of the architectural state of the processor it simulates, enabling it to be used also as a back-end target for a debugger, to assist in software development, as well as providing a Golden Reference Model to a co-simulation environment. Through the use of these performance-enhancing dynamic adaptations, the simulator is capable of simulating an embedded system at speeds approaching, or even exceeding, real time.
international symposium on performance analysis of systems and software | 2017
Harry Wagstaff; Bruno Bodin; Tom Spink; Bjoern Franke
Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmark suites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on ‘real-world’ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memory access performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.
ACM Transactions on Architecture and Code Optimization | 2016
Tom Spink; Harry Wagstaff; Bjoern Franke
Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported by all major processor architectures, including Intel, ARM, PowerPC, and MIPS, these extensions are targeted at virtualization of the same architecture, for example, an x86 guest on an x86 host system. Existing techniques for cross-architecture virtualization, for example, an ARM guest on an x86 host, still incur a substantial overhead for CPU, memory, and I/O virtualization due to the necessity for software emulation of these mismatched system components. In this article, we present a new hardware-accelerated hypervisor called Captive, employing a range of novel techniques that exploit existing hardware virtualization extensions for improving the performance of full-system cross-platform virtualization. We illustrate how (1) guest memory management unit (MMU) events and operations can be mapped onto host memory virtualization extensions, eliminating the need for costly software MMU emulation, (2) a block-based dynamic binary translation engine inside the virtual machine can improve CPU virtualization performance, (3) memory-mapped guest I/O can be efficiently translated to fast I/O specific calls to emulated devices, and (4) the cost for asynchronous guest interrupts can be reduced. For an ARM-based Linux guest system running on an x86 host with Intel VT support, we demonstrate application performance levels, based on SPEC CPU2006 benchmarks, of up to 5.88× over state-of-the-art Qemu and 2.5× on average, achieving a guest dynamic instruction throughput of up to 1280 MIPS (million instructions per second) and 915.52 MIPS, on average.
ACM Transactions in Embedded Computing Systems | 2017
Stanislav Manilov; Bjoern Franke; Anthony James Magrath; Cedric Andrieu
Short-vector S imd and D sp instructions are popular extensions to common I sa s. These extensions deliver excellent performance and compact code for some compute-intensive applications, but they require specialized compiler support. To enable the programmer to explicitly request the use of such an instruction, many C compilers provide platform-specific intrinsic functions, whose implementation is handled specially by the compiler. The use of such intrinsics, however, inevitably results in nonportable code. In this article, we develop a novel methodology for retargeting such nonportable code, which maps intrinsics from one platform to another, taking advantage of similar intrinsics on the target platform. We employ a description language to specify the signature and semantics of intrinsics and perform graph-based pattern matching and high-level code transformations to derive optimized implementations exploiting the target’s intrinsics, wherever possible. We demonstrate the effectiveness of our new methodology, implemented in the F ree R ider tool, by automatically retargeting benchmarks derived from O pen CV samples and a complex embedded application optimized to run on an A rm C ortex -M4 to an I ntel E dison module with S se 4.2 instructions (and vice versa). We achieve a speedup of up to 3.73 over a plain C baseline, and on average 96.0% of the speedup of manually ported and optimized versions of the benchmarks.Short-vector Simd and Dsp instructions are popular extensions to common Isas. These extensions deliver excellent performance and compact code for some compute-intensive applications, but they require specialized compiler support. To enable the programmer to explicitly request the use of such an instruction, many C compilers provide platform-specific intrinsic functions, whose implementation is handled specially by the compiler. The use of such intrinsics, however, inevitably results in nonportable code. In this article, we develop a novel methodology for retargeting such nonportable code, which maps intrinsics from one platform to another, taking advantage of similar intrinsics on the target platform. We employ a description language to specify the signature and semantics of intrinsics and perform graph-based pattern matching and high-level code transformations to derive optimized implementations exploiting the target’s intrinsics, wherever possible. We demonstrate the effectiveness of our new methodology, implemented in the Free Rider tool, by automatically retargeting benchmarks derived from OpenCV samples and a complex embedded application optimized to run on an Arm Cortex-M4 to an Intel Edison module with Sse4.2 instructions (and vice versa). We achieve a speedup of up to 3.73 over a plain C baseline, and on average 96.0% of the speedup of manually ported and optimized versions of the benchmarks.
ACM | 2016
Tom Spink; Harry Wagstaff; Bjoern Franke
Instruction set simulators (ISS) have many uses in embedded software and hardware development and are typically based on dynamic binary translation (DBT), where frequently executed regions of guest instructions are compiled into host instructions using a just-in-time (JIT) compiler. Full-system simulation, which necessitates handling of asynchronous interrupts from e.g. timers and I/O devices, complicates matters as control flow is interrupted unpredictably and diverted from the current region of code. In this paper we present a novel scheme for handling of asynchronous interrupts, which integrates seamlessly into a region-based dynamic binary translator. We first show that our scheme is correct, i.e. interrupt handling is not deferred indefinitely, even in the presence of code regions comprising control flow loops. We demonstrate that our new interrupt handling scheme is efficient as we minimise the number of inserted checks. Interrupt handlers are also presented to the JIT compiler and compiled to native code, further enhancing the performance of our system. We have evaluated our scheme in an ARM simulator using a region-based JIT compilation strategy. We demonstrate that our solution reduces the number of dynamic interrupt checks by 73%, reduces interrupt service latency by 26% and improves throughput of an I/O bound workload by 7%, over traditional per-block schemes.
IEEE | 2014
Harry Wagstaff; Tom Spink; Bjoern Franke
Processor design tools integrate in their workflows generators for instruction set simulators (ISS) from architecture descriptions. However, it is difficult to validate the correctness of these simu-lators. ISA coverage analysis is insufficient to isolate modelling faults, which might only be exposed in corner cases. We present a novel ISA branch coverage analysis, which considers every possible execution path within an instruction and, on demand, generates new test cases to cover the missing paths. We have applied this analysis to industry standard EEMBC and SPEC CPU2006 benchmarks and show that for an ARM V5T model neither of these benchmark suites provides a sufficient ISA coverage to exercise every path through each instruction of the whole instruction set.