Christian Fensch
University of Edinburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Christian Fensch.
international conference on parallel architectures and compilation techniques | 2010
Nick Barrow-Williams; Christian Fensch; Simon W. Moore
Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multi-node systems. As such, they seldom take advantage of the new possibilities that many-core architectures offer. We propose Proximity Coherence, a scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure. Such an optimization is made possible by the comparable cost of local cache accesses with the use of on-chip network resources. Coherency is maintained using lightweight graph structures embedded in the L1 caches. We compare our Proximity Coherence protocol to an existing directory-based MESI protocol using full-system simulations of a 32 core system. Our extension lowers the latency of L1 cache load misses by up to 32% while reducing the bytes transferred on the global on-chip interconnect by up to 19% for a range of parallel benchmarks. Employing Proximity Coherence provides execution time improvements of up to 13%, reduces cache hierarchy energy consumption by up to 30% and delivers a more efficient solution to the challenge of coherence in chip multiprocessors.
high-performance computer architecture | 2008
Christian Fensch; Marcelo Cintra
The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. This paper proposes a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. The proposed mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. It allows only some controlled migration and replication of data and provides a sufficient degree of flexibility in the mapping through an extra level of indirection between virtual pages and physical tiles. We evaluate the proposed tiled CMP architecture on the Splash-2 scientific benchmarks and ALPBench multimedia benchmarks against one with private caches and a distributed directory cache coherence mechanism. Experimental results show that the performance degradation is as little as 0%, and 16% on average, compared to the cache coherent architecture across all benchmarks for 16 and 32 processors.
high performance embedded architectures and compilers | 2013
Thibaut Lutz; Christian Fensch; Murray Cole
GPGPUs are a powerful and energy-efficient solution for many problems. For higher performance or larger problems, it is necessary to distribute the problem across multiple GPUs, increasing the already high programming complexity. In this article, we focus on abstracting the complexity of multi-GPU programming for stencil computation. We show that the best strategy depends not only on the stencil operator, problem size, and GPU, but also on the PCI express layout. This adds nonuniform characteristics to a seemingly homogeneous setup, causing up to 23% performance loss. We address this issue with an autotuner that optimizes the distribution across multiple GPUs.
international conference on functional programming | 2015
Michel Steuwer; Christian Fensch; Sam Lindley; Christophe Dubach
Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed lambda-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.
international workshop on runtime and operating systems for supercomputers | 2015
Alexander Collins; Tim Harris; Murray Cole; Christian Fensch
Running multiple parallel programs on multi-socket multi-core machines using commodity hardware is increasingly common for data analytics and cluster workloads. These workloads exhibit bursty behavior and are rarely tuned to specific hardware. This leads to poor performance due to suboptimal decisions, such as poor choices for which programs run on the same socket. Consequently, there is a renewed importance for schedulers to consider the structure of the machine alongside the dynamic behavior of workloads. This paper introduces LIRA, a spatial-scheduling heuristic for selecting which parallel applications should run on the same socket in a multi-socket machine. We devise two flavors of scheduler using this heuristic: (i) LIRA-static which collects performance data in an offline profiling step to decide the schedule when a program starts, and (ii) LIRA-adaptive which operates dynamically based on hardware performance counters available on off-the-shelf hardware. LIRA-adaptive does not require separate, offline workload characterization runs, and it accommodates a dynamically changing mix of applications, including those with phase changes. We evaluate LIRA-static and LIRA-adaptive using programs from SPEC OMP and two graph analytics projects. We compare our approaches to the best possible performance obtained across all static mappings of 4 programs to 2 sockets, the libgomp OpenMP runtime that comes with GCC and Callisto, a state-of-the-art scheduler. LIRA-static improves system throughput by 10% compared to libgomp, and LIRA-adaptive improves system throughput by 13%. Compared to Callisto, LIRA-adaptive improves performance in 30 of the 32 combinations tested, with an improvement in system throughput of up to 7%, and 3% on average over 32 combinations.
IEEE Transactions on Computers | 2013
Christian Fensch; Nick Barrow-Williams; Robert D. Mullins; Simon W. Moore
Many-core architectures provide an efficient way of harnessing the growing numbers of transistors available. However, energy and latency costs of communication increasingly limit the parallel programs running on these platforms. Existing designs provide a functional communication layer, but not necessarily the most efficient solution. Due to power limitations, efficiency is now a primary concern that motivates us to look again at cache coherence. First, we analyze the communication behavior of parallel applications. The observed sharing patterns reveal considerable locality of shared data accesses between threads with consecutive IDs. This pattern corresponds to strong physical locality between adjacent cores in a chip-multiprocessor (CMP). This paper explores the design of Proximity Coherence: a novel scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links. We exploit these patterns and improve the efficiency of communication. The results show that careful analysis leads to the design of a more efficient coherence protocol. The protocol reduces the latency of load misses by up to 33 percent (17 percent, on average), improving overall execution time by up to 13 percent. Furthermore, it also reduces network-on-chip traffic by 19 percent and energy consumption by up to 30 percent.
acm sigplan symposium on principles and practice of parallel programming | 2015
Thibaut Lutz; Christian Fensch; Murray Cole
State of the art automatic optimization of OpenCL applications focuses on improving the performance of individual compute kernels. Programmers address opportunities for inter-kernel optimization in specific applications by ad-hoc hand tuning: manually fusing kernels together. However, the complexity of interactions between host and kernel code makes this approach weak or even unviable for applications involving more than a small number of kernel invocations or a highly dynamic control flow, leaving substantial potential opportunities unexplored. It also leads to an over complex, hard to maintain code base. We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intra-kernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms hand-optimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.
international conference on parallel architectures and compilation techniques | 2012
Alexander Collins; Christian Fensch; Hugh Leather
Parallel skeletons provide a predefined set of parallel templates that can be combined, nested and parameterized with sequential code to produce complex parallel programs. The implementation of each skeleton includes parameters that have a significant effect on performance; so carefully tuning them is vital. The optimization space formed by these parameters is complex, non-linear, exhibits multiple local optima and is program dependent. This makes manual tuning impractical. Effective automatic tuning is therefore essential for the performance of parallel skeleton programs. In this paper we present MaSiF, a novel tool to auto-tune the parallelization parameters of skeleton parallel programs. It reduces the size of the parameter space using a combination of machine learning, via nearest neighbor classification, and linear dimensionality reduction using Principal Components Analysis. To auto-tune a new program, a set of program features is determined statically and used to compute k nearest neighbors from a set of training programs. Previously collected performance data for the nearest neighbors is used to reduce the size of the search space using Principal Components Analysis. Good parallelization parameters are found quickly by searching this smaller search space. We evaluate MaSiF for two existing parallel frameworks: Threading Building Blocks and FastFlow. MaSiF achieves 89% of the performance of the oracle on average. This exploration requires just 45 parameters values on average, which is ~0.05% of the optimization space. In contrast, a state-of-the-art machine learning approach achieves 51%. MaSiF achieves an average speedup of 1.32× over parallelization parameters chosen by human experts.
Parallel Processing Letters | 2012
Alexander Collins; Christian Fensch; Hugh Leather
Parallel skeletons are a structured parallel programming abstraction that provide programmers with a predefined set of algorithmic templates that can be combined, nested and parameterized with sequential code to produce complex programs. The implementation of these skeletons is currently a manual process, requiring human expertise to choose suitable implementation parameters that provide good performance. This paper presents an empirical exploration of the optimization space of the FastFlow parallel skeleton framework. We performed this using a Monte Carlo search of a random subset of the space, for a representative set of platforms and programs. The results show that the space is program and platform dependent, non-linear, and that automatic search achieves a significant average speedup in program execution time of 1.6× over a human expert. An exploratory data analysis of the results shows a linear dependence between two of the parameters, and that another two parameters have little effect on performance. These properties are then used to reduce the size of the space by a factor of 6, reducing the cost of the search. This provides a starting point for automatically optimizing parallel skeleton programs without the need for human expertise, and with a large improvement in execution time compared to that achievable using human expert tuning.
ieee international conference on high performance computing, data, and analytics | 2013
Alexander Collins; Christian Fensch; Hugh Leather; Murray Cole
Parallel skeletons provide a predefined set of parallel templates that can be combined, nested and parameterized with sequential code to produce complex parallel programs. The implementation of each skeleton includes parameters that have a significant effect on performance; so carefully tuning them is vital. The optimization space formed by these parameters is complex, non-linear, exhibits multiple local optima and is program dependent. This makes manual tuning impractical. Effective automatic tuning is therefore essential for the performance of parallel skeleton programs. In this paper we present MaSiF, a novel tool to auto-tune the parallelization parameters of skeleton parallel programs. It reduces the size of the parameter space using a combination of machine learning, via nearest neighbor classification, and linear dimensionality reduction using Principal Components Analysis. To auto-tune a new program, a set of program features is determined statically and used to compute k nearest neighbors from a set of training programs. Previously collected performance data for the nearest neighbors is used to reduce the size of the search space using Principal Components Analysis. Good parallelization parameters are found quickly by searching this smaller search space. We evaluate MaSiF for two existing parallel frameworks: Threading Building Blocks and FastFlow. MaSiF achieves 89% of the performance of the oracle on average. This exploration requires just 45 parameters values on average, which is ~0.05% of the optimization space. In contrast, a state-of-the-art machine learning approach achieves 51%. MaSiF achieves an average speedup of 1.32× over parallelization parameters chosen by human experts.