Christopher J. Hughes

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher J. Hughes is active.

Explore More

Publication

Featured researches published by Christopher J. Hughes.

international symposium on computer architecture | 2001

Speculative precomputation: long-range prefetching of delinquent loads

Jamison D. Collins; Hong Wang; Dean M. Tullsen; Christopher J. Hughes; Yong-Fong Lee; Daniel M. Lavery; John Paul Shen

This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium#8482; ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.

acm sigplan symposium on principles and practice of parallel programming | 2006

Hybrid transactional memory

Sanjeev Kumar; Michael Chu; Christopher J. Hughes; Partha Kundu; Anthony D. Nguyen

High performance parallel programs are currently difficult to write and debug. One major source of difficulty is protecting concurrent accesses to shared data with an appropriate synchronization mechanism. Locks are the most common mechanism but they have a number of disadvantages, including possibly unnecessary serialization, and possible deadlock. Transactional memory is an alternative mechanism that makes parallel programming easier. With transactional memory, a transaction provides atomic and serializable operations on an arbitrary set of memory locations. When a transaction commits, all operations within the transaction become visible to other threads. When it aborts, all operations in the transaction are rolled back.Transactional memory can be implemented in either hardware or software. A straightforward hardware approach can have high performance, but imposes strict limits on the amount of data updated in each transaction. A software approach removes these limits, but incurs high overhead. We propose a novel hybrid hardware-software transactional memory scheme that approaches the performance of a hardware scheme when resources are not exhausted and gracefully falls back to a software scheme otherwise.

international symposium on microarchitecture | 2001

Saving energy with architectural and frequency adaptations for multimedia applications

Christopher J. Hughes; Jayanth Srinivasan; Sarita V. Adve

General-purpose processors are expected to be increasingly employed for multimedia workloads on systems where reducing energy consumption is an important goal. Researchers have proposed the use of two forms of hardware adaptation - architectural adaptation and dynamic voltage (and frequency) scaling or DVS - to reduce energy. This paper develops and evaluates an integrated algorithm to control both architectural adaptation and DVS targeted to multimedia applications. It also examines the interaction between the two forms of adaptation, identifying when each will perform better in isolation and when the addition of architectural adaptation will benefit DVS. Our adaptation control algorithm is effective in saving energy and exploits most of the available potential. For the applications and systems studied, DVS is consistently better than architectural adaptation in isolation. The addition of architectural adaptation to DVS benefits some applications, but not all. Finally, in a seemingly counter-intuitive result, we find that while less aggressive architectures reduce energy for fixed frequency hardware, with DVS, more aggressive architectures are often more energy efficient.

ieee international conference on high performance computing data and analytics | 2013

Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

Richard M. Yoo; Christopher J. Hughes; Konrad K. Lai; Ravi Rajwar

Intel has recently introduced Intel® Transactional Synchronization Extensions (Intel® TSX) in the Intel 4th Generation Core™ Processors. With Intel TSX, a processor can dynamically determine whether threads need to serialize through lock-protected critical sections. In this paper, we evaluate the first hardware implementation of Intel TSX using a set of high-performance computing (HPC) workloads, and demonstrate that applying Intel TSX to these workloads can provide significant performance improvements. On a set of real-world HPC workloads, applying Intel TSX provides an average speedup of 1.41x. When applied to a parallel user-level TCP/IP stack, Intel TSX provides 1.31x average bandwidth improvement on network intensive applications. We also demonstrate the ease with which we were able to apply Intel TSX to the various workloads.

international parallel and distributed processing symposium | 2013

Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors

Simon J. Pennycook; Christopher J. Hughes; Mikhail Smelyanskiy; Stephen A. Jarvis

We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandias miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256and 512bit). The applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel®Xeon®processors with 256-bit SIMD, and adding a single Intel®Xeon Phi™coprocessor provides up to an additional 2x performance increase. These results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware; and (ii) the considerable performance increase afforded by the use of Intel®Xeon Phi™coprocessors for highly parallel workloads.

international symposium on computer architecture | 2001

Variability in the execution of multimedia applications and implications for architecture

Christopher J. Hughes; Praful Kaul; Sarita V. Adve; Rohit Jain; Chanik Park; Jayanth Srinivasan

Multimedia applications are an increasingly important workload for general-purpose processors. This paper analyzes frame-level execution time variability for several multimedia applications on general-purpose architectures. There are two reasons for such an analysis. First, it has been conjectured that complex features of such architectures (e.g., out-of-order issue) result in unpredictable execution times, making them unsuitable for meeting real-time requirements of multimedia applications. Our analysis tests this conjecture. Second, such an analysis can be used to effectively employ recently proposed adaptive architectures. We find that while execution time varies from frame to frame for many multimedia applications, the variability is mostly caused by the application algorithm and the media input. Aggressive architectural features induce little additional variability (and unpredictability) in execution time, in contrast to conventional wisdom. The presence of frame-level execution time variability motivates frame-level architectural adaptation (e.g., to save energy). Additionally, our results show that execution time generally varies slowly, implying it is possible to dynamically predict the behavior of future frames on a variety of hardware configurations for effective adaptation.

real time systems symposium | 2002

Soft real-time scheduling on simultaneous multithreaded processors

Rohit Jain; Christopher J. Hughes; Sarita V. Adve

Simultaneous multithreading (SMT) improves processor throughput by processing instructions from multiple threads each cycle. This is the first work to explore soft real-time scheduling on an SMT processor. Scheduling with SMT requires two decisions: (1) which threads to run simultaneously (the co-schedule), and (2) how to share processor resources among co-scheduled threads. We explore algorithms for both decisions for soft-real time multimedia applications, focusing more on co-schedule selection. We examine previous multiprocessor co-scheduling algorithms, including partitioning and global scheduling. We propose new variations that consider resource sharing and try to utilize SMT more effectively by exploiting application symbiosis. We find (using simulation) that the best algorithm uses global scheduling, exploits symbiosis, prioritizes high utilization tasks, and uses dynamic resource sharing. This algorithm, however, imposes significant profiling overhead and does not provide admission control. We propose alternatives to overcome these limitations, but at the cost of schedulability.

architectural support for programming languages and operating systems | 2002

Joint local and global hardware adaptations for energy

Christopher J. Hughes; Sarita V. Adve

This work concerns algorithms to control energy-driven architecture adaptations for multimedia applications, without and with dynamic voltage scaling (DVS). We identify a broad design space for adaptation control algorithms based on two attributes: (1) when to adapt or temporal granularity and (2) what structures to adapt or spatial granularity. For each attribute, adaptation may be global or local. Our previous work developed a temporally and spatially global algorithm. It invokes adaptation at the granularity of a full frame of a multimedia application (temporally global) and considers the entire hardware configuration at a time (spatially global). It exploits inter-frame execution time variability, slowing computation just enough to eliminate idle time before the real-time deadline.This paper explores temporally and spatially local algorithms and their integration with the previous global algorithm. The local algorithms invoke architectural adaptation within an application frame to exploit intra-frame execution variability, and attempt to save energy without affecting execution time. We consider local algorithms previously studied for non-real-time applications as well as propose new algorithms. We find that, for systems without and with DVS, the local algorithms are effective in saving energy for multimedia applications, but the new integrated global and local algorithm is best for the systems and applications studied.

international symposium on computer architecture | 2007

Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors

Christopher J. Hughes; Radek Grzeszczuk; Eftychios Sifakis; Daehyun Kim; Sanjeev Kumar; Andrew Selle; Jatin Chhugani; Matthew J. Holliman; Yen-Kuang Chen

We explore the emerging application area of physics-based simulation for computer animation and visual special effects. In particular, we examine its parallelization potential and characterize its behavior on a chip multiprocessor (CMP). Applications in this domain model and simulate natural phenomena, and often direct visual components of motion pictures. We study a set of three workloads that exemplify the span and complexity of physical simulation applications used in a production environment: fluid dynamics, facial animation, and cloth simulation. They are computationally demanding, requiring from a few seconds to several minutes to simulate a single frame; therefore, they can benefit greatly from the acceleration possible with large scale CMPs. Starting with serial versions of these applications, we parallelize code accounting for at least 96% of the serial execution time, targeting a large number of threads.We then study the most expensive modules using a simulated 64-core CMP. For the code representing key modules, we achieve parallel scaling of 45x, 50x, and 30x for fluid, face, and cloth simulations, respectively. The modules have a spectrum of parallel task granularity and locking behavior, and all but one are dominated by loop-level parallelism. Many modules operate on streams of data. In some cases, modules iterate over their data, leading to significant temporal locality. This streaming behavior leads to very high on-die and main memory bandwidth requirements. Finally, most modules have little inter-thread communication since they are data-parallel, but a few require heavy communication between data-parallel operations.

IEEE Signal Processing Magazine | 2009

Parallel scalability in speech recognition

Kisun You; Jike Chong; Youngmin Yi; Ekaterina Gonina; Christopher J. Hughes; Yen-Kuang Chen; Wonyong Sung; Kurt Keutzer

We propose four application-level implementation alternatives called algorithm styles and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On a 44-min speech data set, we demonstrate substantial speedups of 3.4 X on Core i7 and 10.5 X on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms.

Explore More