Harry Wagstaff | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harry Wagstaff is active.

Explore More

Publication

Featured researches published by Harry Wagstaff.

design automation conference | 2013

Early partial evaluation in a JIT-compiled, retargetable instruction set simulator generated from a high-level architecture description

Harry Wagstaff; Miles Gould; Björn Franke; Nigel P. Topham

Modern processor design tools integrate in their workflows generators for instruction set simulators (Iss) from architecture descriptions. Whilst these generated simulators are useful for design evaluation and software development, they suffer from poor performance. We present an ultra-fast JIT-compiled Iss generated from an ARCHC description. We also introduce a novel partial evaluation optimisation, which further improves JIT compilation time and code quality. This results in a simulation rate of 510MiPs for an ARM target across 45 EEMBC and SPEC benchmarks. On average, our Iss is 1.7 times faster than SIMIT-ARM, one of the fastest Iss generated from an architecture description.

international conference on parallel architectures and compilation techniques | 2016

Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding

Bruno Bodin; Luigi Nardi; M. Zeeshan Zia; Harry Wagstaff; Govind Sreekar Shenoy; Murali Emani; John Mawer; Christos Kotselidis; Andy Nisbet; Mikel Luján; Björn Franke; Paul H. J. Kelly; Michael F. P. O'Boyle

System designers typically use well-studied benchmarks to evaluate and improve new architectures and compilers. We design tomorrows systems based on yesterdays applications. In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. Until now, this application could only run in real-time on desktop GPUs. In this work, we examine how it can be mapped to power constrained embedded systems. Key to our approach is the idea of incremental co-design exploration, where optimization choices that concern the domain layer are incrementally explored together with low-level compiler and architecture choices. The goal of this exploration is to reduce execution time while minimizing power and meeting our quality of result objective. As the design space is too large to exhaustively evaluate, we use active learning based on a random forest predictor to find good designs. We show that our approach can, for the first time, achieve dense 3D mapping and tracking in the real-time range within a 1W power budget on a popular embedded device. This is a 4.8× execution time improvement and a 2.8× power reduction compared to the state-of-the-art.

languages, compilers, and tools for embedded systems | 2014

Efficient code generation in a region-based dynamic binary translator

Tom Spink; Harry Wagstaff; Björn Franke; Nigel P. Topham

Region-based JIT compilation operates on translation units comprising multiple basic blocks and, possibly cyclic or conditional, control flow between these. It promises to reconcile aggressive code optimisation and low compilation latency in performance-critical dynamic binary translators. Whilst various region selection schemes and isolated code optimisation techniques have been investigated it remains unclear how to best exploit such regions for efficient code generation. Complex interactions with indirect branch tables and translation caches can have adverse effects on performance if not considered carefully. In this paper we present a complete code generation strategy for a region-based dynamic binary translator, which exploits branch type and control flow profiling information to improve code quality for the common case. We demonstrate that using our code generation strategy a competitive region-based dynamic compiler can be built on top of the LLVM JIT compilation framework. For the ARM-V5T target ISA and SPEC CPU 2006 benchmarks we achieve execution rates of, on average, 867 MIPS and up to 1323 MIPS on a standard X86 host machine, outperforming state-of-the-art QEMU-ARM by delivering a speedup of 264%.

international symposium on performance analysis of systems and software | 2017

SimBench: A portable benchmarking methodology for full-system simulators

Harry Wagstaff; Bruno Bodin; Tom Spink; Bjoern Franke

Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmark suites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on ‘real-world’ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memory access performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.

ACM Transactions on Architecture and Code Optimization | 2016

Hardware-Accelerated Cross-Architecture Full-System Virtualization

Tom Spink; Harry Wagstaff; Bjoern Franke

Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported by all major processor architectures, including Intel, ARM, PowerPC, and MIPS, these extensions are targeted at virtualization of the same architecture, for example, an x86 guest on an x86 host system. Existing techniques for cross-architecture virtualization, for example, an ARM guest on an x86 host, still incur a substantial overhead for CPU, memory, and I/O virtualization due to the necessity for software emulation of these mismatched system components. In this article, we present a new hardware-accelerated hypervisor called Captive, employing a range of novel techniques that exploit existing hardware virtualization extensions for improving the performance of full-system cross-platform virtualization. We illustrate how (1) guest memory management unit (MMU) events and operations can be mapped onto host memory virtualization extensions, eliminating the need for costly software MMU emulation, (2) a block-based dynamic binary translation engine inside the virtual machine can improve CPU virtualization performance, (3) memory-mapped guest I/O can be efficiently translated to fast I/O specific calls to emulated devices, and (4) the cost for asynchronous guest interrupts can be reduced. For an ARM-based Linux guest system running on an x86 host with Intel VT support, we demonstrate application performance levels, based on SPEC CPU2006 benchmarks, of up to 5.88× over state-of-the-art Qemu and 2.5× on average, achieving a guest dynamic instruction throughput of up to 1280 MIPS (million instructions per second) and 915.52 MIPS, on average.

ACM | 2016

LCTES 2016 Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems

Tom Spink; Harry Wagstaff; Bjoern Franke

Instruction set simulators (ISS) have many uses in embedded software and hardware development and are typically based on dynamic binary translation (DBT), where frequently executed regions of guest instructions are compiled into host instructions using a just-in-time (JIT) compiler. Full-system simulation, which necessitates handling of asynchronous interrupts from e.g. timers and I/O devices, complicates matters as control flow is interrupted unpredictably and diverted from the current region of code. In this paper we present a novel scheme for handling of asynchronous interrupts, which integrates seamlessly into a region-based dynamic binary translator. We first show that our scheme is correct, i.e. interrupt handling is not deferred indefinitely, even in the presence of code regions comprising control flow loops. We demonstrate that our new interrupt handling scheme is efficient as we minimise the number of inserted checks. Interrupt handlers are also presented to the JIT compiler and compiled to native code, further enhancing the performance of our system. We have evaluated our scheme in an ARM simulator using a region-based JIT compilation strategy. We demonstrate that our solution reduces the number of dynamic interrupt checks by 73%, reduces interrupt service latency by 26% and improves throughput of an I/O bound workload by 7%, over traditional per-block schemes.

international conference on embedded computer systems architectures modeling and simulation | 2015

Efficient dual-ISA support in a retargetable, asynchronous Dynamic Binary Translator

Tom Spink; Harry Wagstaff; Björn Franke; Nigel P. Topham

Dynamic Binary Translation (DBT) allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Some modern DBT systems decouple their main execution loop from the built-in Just-In-Time (JIT) compiler, i.e. the JIT compiler can operate asynchronously in a different thread without blocking program execution. However, this creates a problem for target architectures with dual-ISA support such as ARM/THUMB, where the ISA of the currently executed instruction stream may be different to the one processed by the JIT compiler due to their decoupled operation and dynamic mode changes. In this paper we present a new approach for dual-ISA support in such an asynchronous DBT system, which integrates ISA mode tracking and hot-swapping of software instruction decoders. We demonstrate how this can be achieved in a retargetable DBT system, where the target ISA is not hard-coded, but a processor-specific module is generated from a high-level architecture description. We have implemented ARM V5T support in our DBT and demonstrate execution rates of up to 1148 MIPS for the SPEC CPU 2006 benchmarks compiled for ARM/THUMB, achieving on average 192%, and up to 323%, of the speed of QEMU, which has been subject to intensive manual performance tuning and requires significant low-level effort for retargeting.

IEEE | 2014

Automated ISA branch coverage analysis and test case generation for retargetable instruction set simulators

Harry Wagstaff; Tom Spink; Bjoern Franke

Processor design tools integrate in their workflows generators for instruction set simulators (ISS) from architecture descriptions. However, it is difficult to validate the correctness of these simu-lators. ISA coverage analysis is insufficient to isolate modelling faults, which might only be exposed in corner cases. We present a novel ISA branch coverage analysis, which considers every possible execution path within an instruction and, on demand, generates new test cases to cover the missing paths. We have applied this analysis to industry standard EEMBC and SPEC CPU2006 benchmarks and show that for an ARM V5T model neither of these benchmark suites provides a sufficient ISA coverage to exercise every path through each instruction of the whole instruction set.

IEEE | 2014

Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2014 International Conference on

Harry Wagstaff; Tom Spink; Bjoern Franke

international conference on robotics and automation | 2018

SLAMBench2: Multi-Objective Head-to-Head Benchmarking for Visual SLAM

Bruno Bodin; Harry Wagstaff; Sajad Saecdi; Luigi Nardi; Emanuele Vespa; John Mawer; Andy Nisbet; Mikel Luján; Steve B. Furber; Andrew J. Davison; Paul H. J. Kelly; Michael O’Boyle

Explore More