Nigel P. Topham | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nigel P. Topham is active.

Explore More

Publication

Featured researches published by Nigel P. Topham.

international symposium on computer architecture | 2000

Multiple-banked register file architectures

José-Lorenzo Cruz; Antonio González; Mateo Valero; Nigel P. Topham

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage register file has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a register file architecture composed of multiple banks. In particular we focus on a multi-level organization of the register file, which provides low latency and simple bypass logic. We propose several caching policies and prefetching strategies and demonstrate the potential of this multiple-banked organization. For instance, we show that a two-level organization degrades IPC by 10% and 2% with respect to a non-pipelined single-banked register file, for SpecInt95 and SpecFP95 respectively, but it increases performance by 87% and 92% when the register file access time is factored in.

ieee international conference on high performance computing data and analytics | 1995

Performance of the decoupled ACRI-1 architecture: the perfect club

Nigel P. Topham; Kenneth McDougall

This paper examines the performance potential of decoupled computer architectures on real-world codes, and includes the first performance bounds calculations to be published for the highly-decoupled ACRI-1 computer architecture. It also constitutes the first published work to report on the effectiveness of a decoupling Fortran90 compiler. Decoupling is an architectural optimisation which offers very high sustained performance through large-scale latency hiding. This paper investigates the applicability of access and control decoupling to real-world codes. We illustrate this with compiler-generated decoupling optimisations for the Perfect Club benchmark suite on the Advanced Computer Research Institutes ACRI-1 system, utilising the frequency of loss of decoupling (LOD) events as a measure of the effectiveness of decoupling to each code. We derive bounds for the performance of these codes and show that, whilst some exhibit performance roughly equivalent to that on vector computers, others exhibit considerably higher performance potential in a decoupled system.

international conference on supercomputing | 1997

Eliminating cache conflict misses through XOR-based placement functions

Antonio González; Mateo Valero; Nigel P. Topham; Joan-Manuel Parcerisa

This paper makes the case for the use of XOR-based placement functions for cache memories. It shows that these XOR-mapping schemes can eliminate many conflict misses for direct-mapped and victim caches and practically all of them for (pseudo) two-way associative organizations. The paper evaluates the performance of XOR-mapping schemes for a number of different cache organizations: direct-mapped, set-associative, victim, hash-rehash, column-associative and skewed-associative. It also proposes novel replacement policies for some of these cache organizations. In particular, it presents a low-cost implementation of a pure LRU replacement policy which demonstrates a significant improvement over the pseudo-LRU replacement previously proposed. The paper shows that for a 8 Kbyte data cache, XOR-mapping schemes approximately halve the miss ratio for two-way associative and column-associative organizations. Skewed-associative caches, which already make use of XOR-mapping functions, can benefit from the LRU replacement and also from the use of more sophisticated mapping functions. For two-way associative, columnassociative and two-way skewed-associative organizations, XORmapping schemes achieve a miss ratio that is not higher than 1.10 times that of a fully-associative cache. XOR mapping schemes also provide a very significant reduction in the miss ratio for the other cache organizations, including the direct-mapped cache. Ultimately, the conclusion of this study is that XOR-based placement functions unequivocally provide highly significant performance benefits to most cache organizations.

european conference on parallel processing | 1997

A Limitation Study into Access Decoupling

G. P. Jones; Nigel P. Topham

This paper presents a study into the theoretical limits of a latency hiding technique called access decoupling. Access decoupling is effective at hiding memory latency for low ILP and conservative dependency analysis [9,12,13]. We assess if this result still applies for maximum ILP and perfect dependency analysis.

IEEE Transactions on Computers | 1999

Randomized cache placement for eliminating conflicts

Nigel P. Topham; Antonio González

Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.

international conference on supercomputing | 1993

The effectiveness of decoupling

Peter L. Bird; Alasdair Rawsthorne; Nigel P. Topham

This paper examines the effectiveness of decoupling as an optimization technique for high-performance computer architectures. Decoupled access execute architectures are described, and the concept of control decoupling is introduced and justified. A description of a highly-decoupled architecture is given, and a metric for the effectiveness of decoupling on particular programs, the Loss of Decoupling frequency is introduced. Finally, a number of real benchmark programs are examined and the applicability of decoupling them is analyzed.

high-performance computer architecture | 1999

Distributed modulo scheduling

Marcio Merino Fernandes; Josep Llosa; Nigel P. Topham

Wide-issue ILP machines can be built using the VLIW approach as many of the hardware complexities found in superscalar processors can be transferred to the compiler. However, the scalability of VLIW architectures is still constrained by the size and number of ports of the register file required by a large number of functional units. Organizations composed of clusters of a few functional units and small private register files have been proposed to deal with this problem; an approach highly dependent on scheduling and partitioning strategies. The paper presents DMS, an algorithm that integrates modulo scheduling and code partitioning in a single procedure. Experimental results have shown that the algorithm is effective for configurations up to 8 clusters, or even more when targeting vectorizable loops.

international conference on robotics and automation | 2015

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Luigi Nardi; Bruno Bodin; M. Zeeshan Zia; John Mawer; Andy Nisbet; Paul H. J. Kelly; Andrew J. Davison; Mikel Luján; Michael F. P. O'Boyle; Graham D. Riley; Nigel P. Topham; Stephen B. Furber

Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPU-accelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.

high performance embedded architectures and compilers | 2008

High Speed CPU Simulation Using LTU Dynamic Binary Translation

Daniel Jones; Nigel P. Topham

In order to increase the speed of dynamic binary translation based simulators we consider the translation of large translation units consisting of multiple blocks. In contrast to other simulators, which translate hot blocks or pages, the techniques presented in this paper profile the target programs execution path at runtime. The identification of hot paths ensures that only executed code is translated whilst at the same time offering greater scope for optimization. Mean performance figures for the functional simulation of EEMBC benchmarks show the new simulation techniques to be at least 63% faster than basic block based dynamic binary translation.

international symposium on microarchitecture | 1997

The design and performance of a conflict-avoiding cache

Nigel P. Topham; Antonio González; José González

High performance architectures depend heavily on efficient multi-level memory hierarchies to minimize the cost of accessing data. This dependence will increase with the expected increases in relative distance to main memory. There have been a number of published proposals for cache conflict-avoidance schemes. We investigate the design and performance of conflict-avoiding cache architectures based on polynomial modulus functions, which earlier research has shown to be highly effective at reducing conflict miss ratios. We examine a number of practical implementation issues and present experimental evidence to support the claim that pseudo-randomly indexed caches are both effective in performance terms and practical from an implementation viewpoint.

Explore More