Hugh Leather | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hugh Leather is active.

Explore More

Publication

Featured researches published by Hugh Leather.

local computer networks | 2007

Emergency Evacuation using Wireless Sensor Networks

Matthew Barnes; Hugh Leather; D. K. Arvind

This paper presents a distributed algorithm to direct evacuees to exits through arbitrarily complex building layouts in emergency situations. The algorithm finds the safest paths for evacuees taking into account predictions of the relative movements of hazards, such as fires, and evacuees. The algorithm is demonstrated on a 64 node wireless sensor network test platform and in simulation. The results of simulations are shown to demonstrate the navigation paths found by the algorithm.

languages, compilers, and tools for embedded systems | 2009

Raced profiles: efficient selection of competing compiler optimizations

Hugh Leather; Michael F. P. O'Boyle; Bruce Worton

Many problems in embedded compilation require one set of optimizations to be selected over another based on run time performance. Self-tuned libraries, iterative compilation and machine learning techniques all compare multiple compiled program versions. In each, program versions are timed to determine which has the best performance. The program needs to be run multiple times for each version because there is noise inherent in most performance measurements. The number of runs must be enough to compare different versions, despite the noise, but executing more than this will waste time and energy. The compiler writer must either risk taking too few runs, potentially getting incorrect results, or taking too many runs increasing the time for their experiments or reducing the number of program versions evaluated. Prior works choose constant size sampling plans where each compiled version is executed a fixed number of times without regard to the level of noise. In this paper we develop a sequential sampling plan which can automatically adapt to the experiment so that the compiler writer can have both confidence in the results and also be sure that no more runs were taken than were needed. We show that our system is able to correctly determine the best optimization settings with between 76% and 87% fewer runs than needed by a brute force, constant sampling size approach. We also compare our approach to JavaSTATS(10); we needed 77% to 89% fewer runs than it needed.

languages and compilers for parallel computing | 2014

Fast automatic heuristic construction using active learning

William F. Ogilvie; Pavlos Petoumenos; Zheng Wang; Hugh Leather

Building effective optimization heuristics is a challenging task which often takes developers several months if not years to complete. Predictive modelling has recently emerged as a promising solution, automatically constructing heuristics from training data. However, obtaining this data can take months per platform. This is becoming an ever more critical problem and if no solution is found we shall be left with out of date heuristics which cannot extract the best performance from modern machines.

symposium on code generation and optimization | 2017

Minimizing the cost of iterative compilation with active learning

William F. Ogilvie; Pavlos Petoumenos; Zheng Wang; Hugh Leather

Since performance is not portable between platforms, engineers must fine-tune heuristics for each processor in turn. This is such a laborious task that high-profile compilers, supporting many architectures, cannot keep up with hardware innovation and are actually out-of-date. Iterative compilation driven by machine learning has been shown to be efficient at generating portable optimization models automatically. However, good quality models require costly, repetitive, and extensive training which greatly hinders the wide adoption of this powerful technique. In this work, we show that much of this cost is spent collecting training data, runtime measurements for different optimization decisions, which contribute little to the final heuristic. Current implementations evaluate randomly chosen, often redundant, training examples a pre-configured, almost always excessive, number of times — a large source of wasted effort. Our approach optimizes not only the selection of training examples but also the number of samples per example, independently. To evaluate, we construct 11 high-quality models which use a combination of optimization settings to predict the runtime of benchmarks from the SPAPT suite. Our novel, broadly applicable, methodology is able to reduce the training overhead by up to 26x compared to an approach with a fixed number of sample runs, transforming what is potentially months of work into days.

international conference on parallel architectures and compilation techniques | 2017

End-to-End Deep Learning of Optimization Heuristics

Christopher Cummins; Pavlos Petoumenos; Zheng Wang; Hugh Leather

Accurate automatic optimization heuristics are necessary for dealing with thecomplexity and diversity of modern hardware and software. Machine learning is aproven technique for learning such heuristics, but its success is bound by thequality of the features used. These features must be hand crafted by developersthrough a combination of expert domain knowledge and trial and error. This makesthe quality of the final model directly dependent on the skill and availabletime of the system architect.Our work introduces a better way for building heuristics. We develop a deepneural network that learns heuristics over raw code, entirely without using codefeatures. The neural network simultaneously constructs appropriaterepresentations of the code and learns how best to optimize, removing the needfor manual feature creation. Further, we show that our neural nets can transferlearning from one optimization problem to another, improving the accuracy of newmodels, without the help of human experts.We compare the effectiveness of our automatically generated heuristics againstones with features hand-picked by experts. We examine two challenging tasks:predicting optimal mapping for heterogeneous parallelism and GPU threadcoarsening factors. In 89% of the cases, the quality of our fully automaticheuristics matches or surpasses that of state-of-the-art predictive models usinghand-crafted features, providing on average 14% and 12% more performance withno human effort expended on designing features.

symposium on code generation and optimization | 2017

Synthesizing benchmarks for predictive modeling

Christopher Cummins; Pavlos Petoumenos; Zheng Wang; Hugh Leather

Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces. What is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space. We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from hand-written code. We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27x. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30x.

international conference on parallel and distributed systems | 2015

Power Capping: What Works, What Does Not

Pavlos Petoumenos; Lev Mukhanov; Zheng Wang; Hugh Leather; Dimitrios S. Nikolopoulos

Peak power consumption is the first order design constraint of data centers. Though peak power consumption is rarely, if ever, observed, the entire data center facility must prepare for it, leading to inefficient usage of its resources. The most prominent way for addressing this issue is to limit the power consumption of the data center IT facility far below its theoretical peak value. Many approaches have been proposed to achieve that, based on the same small set of enforcement mechanisms, but there has been no corresponding work on systematically examining the advantages and disadvantages of each such mechanism. In the absence of such a study, it is unclear what is the optimal mechanism for a given computing environment, which can lead to unnecessarily poor performance if an inappropriate scheme is used. This paper fills this gap by comparing for the first time five widely used power capping mechanisms under the same hardware/software setting. We also explore possible alternative power capping mechanisms beyond what has been previously proposed and evaluate them under the same setup. We systematically analyze the strengths and weaknesses of each mechanism, in terms of energy efficiency, overhead, and predictable behavior. We show how these mechanisms can be combined in order to implement an optimal power capping mechanism which reduces the slowdown compared to the most widely used mechanism by up to 88%. Our results provide interesting insights regarding the different trade-offs of power capping techniques, which will be useful for designing and implementing highly efficient power capping in the future.

languages, compilers, and tools for embedded systems | 2012

Efficiently parallelizing instruction set simulation of embedded multi-core processors using region-based just-in-time dynamic binary translation

Stephen C. Kyle; Igor Böhm; Björn Franke; Hugh Leather; Nigel P. Topham

Embedded systems, as typified by modern mobile phones, are already seeing a drive toward using multi-core processors. The number of cores will likely increase rapidly in the future. Engineers and researchers need to be able to simulate systems, as they are expected to be in a few generations time, running simulations of many-core devices on todays multi-core machines. These requirements place heavy demands on the scalability of simulation engines, the fastest of which have typically evolved from just-in-time (Jit) dynamic binary translators (Dbt). Existing work aimed at parallelizing Dbt simulators has focused exclusively on trace-based Dbt, wherein linear execution traces or perhaps trees thereof are the units of translation. Region-based Dbt simulators have not received the same attention and require different techniques than their trace-based cousins. In this paper we develop an innovative approach to scaling multi-core, embedded simulation through region-based Dbt. We initially modify the Jit code generator of such a simulator to emit code that does not depend on a particular thread with its thread-specific context and is, therefore, thread-agnostic. We then demonstrate that this thread-agnostic code generation is comparable to thread-specific code with respect to performance, but also enables the sharing of JIT-compiled regions between different threads. This sharing optimisation, in turn, leads to significant performance improvements for multi-threaded applications. In fact, our results confirm that an average of 76% of all JIT-compiled regions can be shared between 128 threads in representative, parallel workloads. We demonstrate that this translates into an overall performance improvement by 1.44x on average and up to 2.40x across 12 multi-threaded benchmarks taken from the Splash-2 benchmark suite, targeting our high-performance multi-core Dbt simulator for embedded Arc processors running on a 4-core Intel host machine.

ieee international symposium on workload characterization | 2014

Measuring QoE of interactive workloads and characterising frequency governors on mobile devices

Volker Seeker; Pavlos Petoumenos; Hugh Leather; Bjoern Franke

Mobile computing devices such as smartphones and tablets have become tightly integrated with many peoples life, both at work and at home. Users spend large amounts of time interacting with their mobile device and demand an excellent user experience in terms of responsiveness, whilst simultaneously expecting a long battery life between charging cycles. Frequency governors, responsible for increasing or decreasing the Cpu clock frequency depending on the current workload and external events, try to balance the two contrasting goals of high performance and low energy consumption. However, despite their critical role in providing energy efficiency it is difficult to measure the effectiveness of frequency governors in an interactive environment. In this paper we develop a novel methodology for creating repeatable, fully automated, realistic, workloads that can accurately measure time lag in interactive applications resulting from non-optimally selected operating frequencies. We also introduce a new metric capturing the user experience for different Android frequency governors. We evaluate interactive workloads to demonstrate how our approach enables us to automatically record and replay sequences of user interactions for different system configurations. We demonstrate that none of the available Android frequency governors performs particularly well, but leave substantial room for improvement. We show that energy savings of up to 27% are possible, whilst delivering a user experience that is better than that provided by the standard Android frequency governor. We also show that it is possible to save 47% energy with performance that is indistinguishable from permanently running the Cpu at the highest frequency.

international conference on parallel architectures and compilation techniques | 2012

MaSiF: machine learning guided auto-tuning of parallel skeletons

Alexander Collins; Christian Fensch; Hugh Leather

Parallel skeletons provide a predefined set of parallel templates that can be combined, nested and parameterized with sequential code to produce complex parallel programs. The implementation of each skeleton includes parameters that have a significant effect on performance; so carefully tuning them is vital. The optimization space formed by these parameters is complex, non-linear, exhibits multiple local optima and is program dependent. This makes manual tuning impractical. Effective automatic tuning is therefore essential for the performance of parallel skeleton programs. In this paper we present MaSiF, a novel tool to auto-tune the parallelization parameters of skeleton parallel programs. It reduces the size of the parameter space using a combination of machine learning, via nearest neighbor classification, and linear dimensionality reduction using Principal Components Analysis. To auto-tune a new program, a set of program features is determined statically and used to compute k nearest neighbors from a set of training programs. Previously collected performance data for the nearest neighbors is used to reduce the size of the search space using Principal Components Analysis. Good parallelization parameters are found quickly by searching this smaller search space. We evaluate MaSiF for two existing parallel frameworks: Threading Building Blocks and FastFlow. MaSiF achieves 89% of the performance of the oracle on average. This exploration requires just 45 parameters values on average, which is ~0.05% of the optimization space. In contrast, a state-of-the-art machine learning approach achieves 51%. MaSiF achieves an average speedup of 1.32× over parallelization parameters chosen by human experts.

Explore More