John Wickerson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John Wickerson is active.

Explore More

Publication

Featured researches published by John Wickerson.

architectural support for programming languages and operating systems | 2015

GPU Concurrency: Weak Behaviours and Programming Assumptions

Jade Alglave; Mark Batty; Alastair F. Donaldson; Ganesh Gopalakrishnan; Jeroen Ketema; Daniel Poetzl; Tyler Sorensen; John Wickerson

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent programs), we questioned the assumptions in programming guides and vendor documentation about the guarantees provided by hardware. We developed a tool to generate thousands of litmus tests and run them under stressful workloads. We observed a litany of previously elusive weak behaviours, and exposed folklore beliefs about GPU programming---often supported by official tutorials---as false. As a way forward, we propose a model of Nvidia GPU hardware, which correctly models every behaviour witnessed in our experiments. The model is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy.

symposium on principles of programming languages | 2016

Overhauling SC atomics in C11 and OpenCL

Mark Batty; Alastair F. Donaldson; John Wickerson

Despite the conceptual simplicity of sequential consistency (SC), the semantics of SC atomic operations and fences in the C11 and OpenCL memory models is subtle, leading to convoluted prose descriptions that translate to complex axiomatic formalisations. We conduct an overhaul of SC atomics in C11, reducing the associated axioms in both number and complexity. A consequence of our simplification is that the SC operations in an execution no longer need to be totally ordered. This relaxation enables, for the first time, efficient and exhaustive simulation of litmus tests that use SC atomics. We extend our improved C11 model to obtain the first rigorous memory model formalisation for OpenCL (which extends C11 with support for heterogeneous many-core programming). In the OpenCL setting, we refine the SC axioms still further to give a sensible semantics to SC operations that employ a ‘memory scope’ to restrict their visibility to specific threads. Our overhaul requires slight strengthenings of both the C11 and the OpenCL memory models, causing some behaviours to become disallowed. We argue that these strengthenings are natural, and that all of the formalised C11 and OpenCL compilation schemes of which we are aware (Power and x86 CPUs for C11, AMD GPUs for OpenCL) remain valid in our revised models. Using the HERD memory model simulator, we show that our overhaul leads to an exponential improvement in simulation time for C11 litmus tests compared with the original model, making *exhaustive* simulation competitive, time-wise, with the *non-exhaustive* CDSChecker tool.We study how the C11 memory model can be simplified and how it can be extended. Our first contribution is to propose a mild st rengthening of the model that enables the rules pertaining to sequentially-co nsistent (SC) operations to be significantly simplified. We eliminate one of the total o rders that candidate executions must range over, leading to a model that is signifi ca tly faster to simulate. Our endeavours to simplify the C11 memory model are pa ticularly timely, now that it provides a foundation for memory models of more ex otic languages – such as OpenCL 2.0, an extension of C for programming hetero geneous systems composed of CPUs, GPUs and other devices. Our second con tribution is the first mechanised formalisation of the OpenCL 2.0 memory mode l, extending our simplified C11 model. Our C11 and OpenCL memory model formali sations are expressed in the.cat language of Alglave et al., the native input format of the herd memory model simulator. Originally designed for the efficie nt simulation of hardware memory models, we have extended h rd to support language-level memory models.

ACM Transactions on Programming Languages and Systems | 2015

The Design and Implementation of a Verification Technique for GPU Kernels

Adam Betts; Nathan Chong; Alastair F. Donaldson; Jeroen Ketema; Shaz Qadeer; Paul Thomson; John Wickerson

We present a technique for the formal verification of GPU kernels, addressing two classes of correctness properties: data races and barrier divergence. Our approach is founded on a novel formal operational semantics for GPU kernels termed <i>synchronous, delayed visibility (SDV)</i> semantics, which captures the execution of a GPU kernel by multiple groups of threads. The SDV semantics provides operational definitions for barrier divergence and for both inter- and intra-group data races. We build on the semantics to develop a method for reducing the task of verifying a massively parallel GPU kernel to that of verifying a sequential program. This completely avoids the need to reason about thread interleavings, and allows existing techniques for sequential program verification to be leveraged. We describe an efficient encoding of data race detection and propose a method for automatically inferring the loop invariants that are required for verification. We have implemented these techniques as a practical verification tool, GPUVerify, that can be applied directly to OpenCL and CUDA source code. We evaluate GPUVerify with respect to a set of 162 kernels drawn from public and commercial sources. Our evaluation demonstrates that GPUVerify is capable of efficient, automatic verification of a large number of real-world kernels.

conference on object oriented programming systems languages and applications | 2015

Remote-scope promotion: clarified, rectified, and verified

John Wickerson; Mark Batty; Bradford M. Beckmann; Alastair F. Donaldson

Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon.

Software and Systems Safety - Specification and Verification | 2011

Unifying Models of Data Flow

Tony Hoare; John Wickerson

We propose a model of computation, based on data flow, that uni fies several disparate programming phenomena, including local and shared variables, synchronised and buffered communication, reliable and unr eliable channels, dynamic and static allocation, explicit and garbage-collect ed disposal, fine-grained and coarse-grained concurrency, and weakly and strongly co nsistent memory.

european symposium on programming | 2013

Ribbon proofs for separation logic

John Wickerson; Mike Dodds; Matthew J. Parkinson

We present ribbon proofs, a diagrammatic system for proving program correctness based on separation logic. Ribbon proofs emphasise the structure of a proof, so are intelligible and pedagogical. Because they contain less redundancy than proof outlines, and allow each proof step to be checked locally, they may be more scalable. Where proof outlines are cumbersome to modify, ribbon proofs can be visually manoeuvred to yield proofs of variant programs. This paper introduces the ribbon proof system, proves its soundness and completeness, and outlines a prototype tool for validating the diagrams in Isabelle.

field programmable gate arrays | 2016

Automatically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis

Xitong Gao; John Wickerson; George A. Constantinides

Loops are pervasive in numerical programs, so high-level synthesis (HLS) tools use state-of-the-art scheduling techniques to pipeline them efficiently. Still, the run time performance of the resultant FPGA implementation is limited by data dependences between loop iterations. Some of these dependence constraints can be alleviated by rewriting the program according to arithmetic identities (e.g. associativity and distributivity), memory access reductions, and control flow optimisations (e.g. partial loop unrolling). HLS tools cannot safely enable such rewrites by default because they may impact the accuracy of floating-point computations and increase area usage. In this paper, we introduce the first open-source program optimizer for automatically rewriting a given program to optimize latency while controlling for accuracy and area. Our tool, SOAP3, reports a multi-dimensional Pareto frontier that the programmer can use to resolve the trade-off according to their needs. When applied to a suite of PolyBench and Livermore Loops benchmarks, our tool has generated programs that enjoy up to a 12x speedup, with a simultaneous 7x increase in accuracy, at a cost of up to 4x more LUTs.

field-programmable technology | 2015

Custom-sized caches in application-specific memory hierarchies

Felix Winterstein; Kermin Fleming; Hsin-Jung Yang; John Wickerson; George A. Constantinides

Developing FPGA implementations with an input specification in a high-level programming language such as C/C++ or OpenCL allows for a substantially shortened design cycle compared to a design entry at register transfer level. This work targets high-level synthesis (HLS) implementations that process large amounts of data and therefore require access to an off-chip memory. We leverage the customizability of the FPGA on-chip memory to automatically construct a multi-cache architecture in order to enhance the performance of the interface between parallel functional units of the HLS core and an external memory. Our focus is on automatic cache sizing. Firstly, our technique determines and uses up unused left-over block RAM resources for the construction of on-chip caches. Secondly, we devise a high-level cache performance estimation based on the memory access trace of the program. We use this memory trace to find a heterogeneous configuration of cache sizes, tailored to the applications memory access characteristic, that maximizes the performance of the multi-cache system subject to an on-chip memory resource constraint. We evaluate our technique with three benchmark implementations on an FPGA board and obtain a reduction in execution latency of up to 2× (1.5× on average) when compared to a one-size-fits-all cache sizing. We also quantify the impact of our automatically generated cache system on the overall energy consumption of the implementation.

international workshop on opencl | 2014

KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters

Ethel Bardsley; Alastair F. Donaldson; John Wickerson

GPUVerify is a static analysis tool for verifying that GPU kernels are free from data races and barrier divergence. It is intended as an automatic tool, but its usability is impaired by the fact that the user must explicitly supply the kernel source code, the number of work items and work groups, and preconditions on key kernel arguments. Extracting this information from non-trivial OpenCL applications is laborious and error-prone. We describe an extension to GPUVerify, called KernelInterceptor, that automates the extraction of this information from a given OpenCL application. After recompiling the application having included an additional header file, and linking with an additional library, KernelInterceptor is able to detect each dynamic kernel launch and record the values of the various parameters in a series of log files. GPUVerify can then be invoked to examine these log files and verify each kernel instance. We explain how the interception mechanism works, and comment on the extent to which it improves the usability of GPUVerify.

field programmable custom computing machines | 2016

Loop Splitting for Efficient Pipelining in High-Level Synthesis

Junyi Liu; John Wickerson; George A. Constantinides

Loop pipelining is widely adopted as a key optimization method in high-level synthesis (HLS). However, when complex memory dependencies appear in a loop, commercial HLS tools are still not able to maximize pipeline performance. In this paper, we leverage parametric polyhedral analysis to reason about memory dependence patterns that are uncertain (i.e., parameterised by an undetermined variable) and/or non-uniform (i.e., varying between loop iterations). We develop an automated source-to-source code transformation to split the loop into pieces, which are then synthesised by Vivado HLS as the hardware generation back-end. Our technique allows generated loops to run with a minimal interval, automatically inserting statically-determined parametric pipeline breaks at those iterations violating dependencies. Our experiments on seven representative benchmarks show that, compared to default loop pipelining, our parametric loop splitting improves pipeline performance by 4.3× in terms of clock cycles per iteration. The optimized pipelines consume 2.0× as many LUTs, 1.8× as many registers, and 1.1× as many DSP blocks. Hence the area-time product is improved by nearly a factor of 2.

Explore More