Chris J. Newburn | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chris J. Newburn is active.

Explore More

Publication

Featured researches published by Chris J. Newburn.

symposium on code generation and optimization | 2011

Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

Chris J. Newburn; Byoungro So; Zhenying Liu; Michael McCool; Anwar M. Ghuloum; Stefanus Jakobus Du Toit; Zhigang Wang; Zhaohui Du; Yongjian Chen; Gansha Wu; Peng Guo; Zhanglin Liu; Dan Zhang

Our ability to create systems with large amount of hardware parallelism is exceeding the average software developers ability to effectively program them. This is a problem that plagues our industry. Since the vast majority of the worlds software developers are not parallel programming experts, making it easy to write, port, and debug applications with sufficient core and vector parallelism is essential to enabling the use of multi- and many-core processor architectures. However, hardware architectures and vector ISAs are also shifting and diversifying quickly, making it difficult for a single binary to run well on all possible targets. Because of this, retargetability and dynamic compilation are of growing relevance. This paper introduces Intel® Array Building Blocks (ArBB), which is a retargetable dynamic compilation framework. This system focuses on making it easier to write and port programs so that they can harvest data and thread parallelism on both multi-core and heterogeneous many-core architectures, while staying within standard C++. ArBB interoperates with other programming models to help meet the demands we hear from customers for a solution with both greater programmer productivity and good performance. This work makes contributions in language features, compiler architecture, code transformations and optimizations. It presents performance data from the current beta release of ArBB and quantitatively shows the impact of some key analyses, enabling transformations and optimizations for a variety of benchmarks that are of interest to our customers.

international symposium on microarchitecture | 2003

Using interaction costs for microarchitectural bottleneck analysis

Brian A. Fields; Rastislav Bodik; Mark D. Hill; Chris J. Newburn

Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other. This parallelism makes it difficult to both: (a) assign a cost to an event (e.g., to one of two overlapping cache misses); and (b) assign blame for each cycle (e.g., for a cycle where many, overlapping resources are active). This paper introduces a new model for understanding event costs to facilitate processor design and optimization. First, we observe that everything in a machine (instructions, hardware structures, events) can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial). Second, we illustrate the value of using interaction costs in processor design and optimization. Finally, we propose performance-monitoring hardware for measuring interaction costs that is suitable for modern processors.

Archive | 2014

Programming Abstractions for Data Locality

Adrian Tate; Amir Kamil; Anshu Dubey; Armin Groblinger; Brad Chamberlain; Brice Goglin; Harold C. Edwards; Chris J. Newburn; David Padua; Didem Unat; Emmanuel Jeannot; Frank Hannig; Gysi Tobias; Hatem Ltaief; James C. Sexton; Jesús Labarta; John Shalf; Karl Fuerlinger; Kathryn O'Brien; Leonidas Linardakis; Maciej Besta; Marie-Christine Sawley; Mark James Abraham; Mauro Bianco; Miquel Pericàs; Naoya Maruyama; Paul H. J. Kelly; Peter Messmer; Robert B. Ross; Romain Ciedat

The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.

international supercomputing conference | 2013

Offload Compiler Runtime for the Intel® Xeon PhiTM Coprocessor

Chris J. Newburn; Rajiv Deodhar; Serguei Dmitriev; Ravi Murty; Ravi Narayanaswamy; John A. Wiegert; Francisco Chinchilla; Russell McGuire

The Intel® Xeon PhiTM coprocessor platform enables offload of computation from a host processor to a coprocessor that is a fully-functional Intel® Architecture CPU. This paper presents the C/C++ and Fortran compiler offload runtime for that coprocessor. The paper addresses why offload to a coprocessor is useful, how it is specified, and what the conditions for the profitability of offload are. It also serves as a guide to potential third-party developers of offload runtimes, such as a gcc-based offload compiler, ports of existing commercial offloading compilers to Intel® Xeon PhiTM coprocessor such as CAPS®, and third-party offload library vendors that Intel is working with, such as NAG® and MAGMA®. It describes the software architecture and design of the offload compiler runtime. It enumerates the key performance features for this heterogeneous computing stack, related to initialization, data movement and invocation. Finally, it evaluates the performance impact of those features for a set of directed micro-benchmarks and larger workloads.

IEEE Transactions on Parallel and Distributed Systems | 2017

Trends in Data Locality Abstractions for HPC Systems

Didem Unat; Anshu Dubey; Torsten Hoefler; John Shalf; Mark James Abraham; Mauro Bianco; Bradford L. Chamberlain; Romain Cledat; H. Carter Edwards; Hal Finkel; Karl Fuerlinger; Frank Hannig; Emmanuel Jeannot; Amir Kamil; Jeff Keasler; Paul H. J. Kelly; Vitus J. Leung; Hatem Ltaief; Naoya Maruyama; Chris J. Newburn; Miquel Pericàs

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.

high-performance computer architecture | 2016

LASER: Light, Accurate Sharing dEtection and Repair

Liang Luo; Akshitha Sriraman; Brooke Fugate; Shiliang Hu; Gilles Pokam; Chris J. Newburn; Joseph Devietti

Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the programs actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment. This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intels Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware.

ieee international conference on high performance computing data and analytics | 2015

Heterogeneous Architecture Library

P. Souza; Chris J. Newburn; L. Borges

Today’s platforms are becoming increasingly heterogeneous. A given platform may have many different computing elements in it: CPUs, coprocessors and GPUs of various kinds. And over time, the platforms on which seismic codes run may change, such that even if a given platform doesn’t have so much variety, the same code base needs to be portable across a wide variety of targets. How can seismic applications support this kind of portability? One answer is the heterogeneous architecture library that Petrobras has created. This library has been in production use since early 2010 by Petrobras for RTM. It has three back ends: CUDA, OpenCL and regular CPUs. A new backend is in deployment to support Intel® hStreams library, which provides a streaming abstraction for heterogeneous platforms. The main RTM application and code that manages various kinds of devices is all portable.

languages and compilers for parallel computing | 2014

Unification of Static and Dynamic Analyses to Enable Vectorization

Ashay Rane; Rakesh Krishnaiyer; Chris J. Newburn; James C. Browne; Leonardo Fialho; Zakhar Matveev

Modern compilers execute sophisticated static analyses to enable optimization across a wide spectrum of code patterns. However, there are many cases where even the most sophisticated static analysis is insufficient or where the computation complexity makes complete static analysis impractical. It is often possible in these cases to discover further opportunities for optimization from dynamic profiling and provide this information to the compiler, either by adding directives or pragmas to the source, or by modifying the source algorithm or implementation. For current and emerging generations of chips, vectorization is one of the most important of these optimizations. This paper defines, implements, and applies a systematic process for combining the information acquired by static analysis by modern compilers with information acquired by a targeted, high-resolution, low-overhead dynamic profiling tool to enable additional and more effective vectorization. Opportunities for more effective vectorization are frequent and the performance gains obtained are substantial: we show a geometric mean across several benchmarks of over 1.5x in speedup on the Intel Xeon Phi coprocessor.

international symposium on microarchitecture | 2004

Interaction cost: for when event counts just don't add up

Brian A. Fields; Rastislav Bodik; Mark D. Hill; Chris J. Newburn

Most performance analysis tasks boil down to finding bottlenecks. In the context of this article, a bottleneck is any event (for example, branch mispredict, window stall, or arithmetic-logic unit (ALU) operation) that limits performance. Bottleneck analysis is critical to an architects work, whether the goal is tuning processors for energy efficiency, improving the effectiveness of optimizations, or designing a more balanced processor. Interaction cost helps to improve processor performance and decrease power consumption by identifying when designers can choose among a set of optimizations and when its necessary to perform them all

High Performance Parallelism Pearls#R##N#Volume 2: Multicore and Many-core Programming Approaches | 2016

Fast Matrix Computations on Heterogeneous Streams

Gaurav Bansal; Chris J. Newburn; Paul Besl

This chapter examines the hStreams library which supports programming of a heterogeneous system by abstracting programming to be akin to feeding a system with streams of actions (computations, data transfers, and synchronizations). Use of the library is illustrated with some problems from the field of linear algebra, endeavoring to show the speed and flexibility of hStreams on both processors and coprocessors. Performance results highlight four key benefits of using hStreams: (1) concurrency of computes across nodes and within a node, (2) pipelined concurrency among data transfers, and between data transfers and computes, (3) ease of use in exploring the optimization space, and (4) separation of concerns between scientists and tuners, between those who may wish to be minimalists, and those who wish to have visibility and exert full control over performance.

Explore More