Christopher Barton | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher Barton is active.

Explore More

Publication

Featured researches published by Christopher Barton.

international conference on parallel architectures and compilation techniques | 2012

Evaluation of Blue Gene/Q hardware support for transactional memories

Amy Wang; Matthew Gaudet; Peng Wu; José Nelson Amaral; Martin Ohmacht; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q (BG/Q) machine. The TM programming model supports most C/C++ programming constructs on top of a best-effort HTM with the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP benchmarks on BG/Q is the first of its kind in understanding characteristics of running coarse-grained TM workloads on HTMs. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

programming language design and implementation | 2006

Shared memory programming for large scale machines

Christopher Barton; CĆlin Casçaval; George S. Almasi; Yili Zheng; Montse Farreras; Siddhartha Chatterje; José Nelson Amaral

This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.

languages and compilers for parallel computing | 2007

Multidimensional Blocking in UPC

Christopher Barton; Călin Caşcaval; George S. Almasi; Rahul Garg; José Nelson Amaral; Montse Farreras

Partitioned Global Address Space (PGAS) languages offer an attractive, high-productivity programming model for programming large-scale parallel machines. PGAS languages, such as Unified Parallel C (UPC), combine the simplicity of shared-memory programming with the efficiency of the message-passing paradigm by allowing users control over the data layout. PGAS languages distinguish between private, shared-local, and shared-remote memory, with shared-remote accesses typically much more expensive than shared-local and private accesses, especially on distributed memory machines where shared-remote access implies communication over a network. In this paper we present a simple extension to the UPC language that allows the programmer to block shared arrays in multiple dimensions. We claim that this extension allows for better control of locality, and therefore performance, in the language. We describe an analysis that allows the compiler to distinguish between local shared array accesses and remote shared array accesses. Local shared array accesses are then transformed into direct memory accesses by the compiler, saving the overhead of a locality check at runtime. We present results to show that locality analysis is able to significantly reduce the number of shared accesses.

compiler construction | 2005

Generalized index-set splitting

Christopher Barton; Arie Tal; Bob Blainey; José Nelson Amaral

This paper introduces Index-Set Splitting (ISS), a technique that splits a loop containing several conditional statements into several loops with less complex control flow. Contrary to the classic loop unswitching technique, ISS splits loops when the conditional is loop variant. ISS uses an Index Sub-range Tree (IST) to identify the structure of the conditionals in the loop and to select which conditionals should be eliminated. This decision is based on an estimation of the code growth for each splitting: a greedy algorithm spends a pre-determined code growth budget. ISTs separate the decision about which splits to perform from the actual code generation for the split loops. The use of ISS to improve a loop fusion framework is then discussed. ISS opportunity identification in the SPEC2000 benchmark suite and three other suites demonstrate that ISS is a general technique that may benefit other compilers.

IEEE Transactions on Computers | 2015

Software Support and Evaluation of Hardware Transactional Memory on Blue Gene/Q

Amy Wang; Matthew Gaudet; Peng Wu; Martin Ohmacht; José Nelson Amaral; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of a transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q machine. The TM programming model supports most C/C++ programming constructs using a best-effort HTM and the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP and the RMS-TM benchmark suites on BG/Q is the first of its kind in understanding characteristics of running TM workloads on real hardware TM. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

languages and compilers for parallel computing | 2006

A characterization of shared data access patterns in UPC programs

Christopher Barton; Călin Caşcaval; José Nelson Amaral

The main attraction of Partitioned Global Address Space (PGAS) languages to programmers is the ability to distribute the data to exploit the affinity of threads within shared-memory domains. Thus, PGAS languages, such as Unified Parallel C (UPC), are a promising programming paradigm for emerging parallel machines that employ hierarchical data- and task-parallelism. For example, large systems are built as distributed-shared memory architectures, where multicore nodes access a local, coherent address space and many such nodes are interconnected in a non-coherent address space to form a high-performance system. This paper studies the access patterns of shared data in UPC programs. By analyzing the access patterns of shared data in UPC we are able to make three major observations about the characteristics of programs written in a PGAS programming model: (i) there is strong evidence to support the development of automatic identification and automatic privatization of local shared data accesses; (ii) the ability for the programmer to specify how shared data is distributed among the executing threads can result in significant performance improvements; (iii) running UPC programs on a hybrid architecture will significantly increase the opportunities for automatic privatization of local shared data accesses.

conference of the centre for advanced studies on collaborative research | 2009

OpenMP tasking analysis for programmers

Xavier Teruel; Christopher Barton; Alejandro Duran; Xavier Martorell; Eduard Ayguadé; Priya Unnikrishnan; Guansong Zhang; Raul Esteban Silvera

As of 2008, the OpenMP 3.0 standard includes task support allowing programmers to exploit irregular parallelism. Although several compilers are providing support for this new feature there has not been extensive investigation into the real possibilities of this extension. Several papers have discussed the programming model itself while other papers have discussed design and implementation on different platforms. There are also papers demonstrating performance results using well known kernel applications. This paper presents an analysis of the OpenMP tasking model possibilities, using the IBM XL compiler implementation. Using different parameters such as the number of tasks, task granularity and parallelism pattern, this paper explores how such parameters can affect the average performance and identifies the limits of the OpenMP tasking model.

conference of the centre for advanced studies on collaborative research | 2010

Reducing data access latency in SDSM systems using runtime optimizations

Javier Bueno; Xavier Martorell; Juan José Costa; Toni Cortes; Eduard Ayguadé; Guansong Zhang; Christopher Barton; Raul Esteban Silvera

Software Distributed Shared Memory (SDSM) systems offer a convenient way to run applications developed for shared memory systems on distributed systems with no changes to them. However, since SDSM systems add an extra layer of abstraction to the memory hierarchy, applications may suffer performance problems when running on top of them. Our main research interest is to develop a set of compiler and runtime system techniques that widen the range of applications that can efficiently run on SDSM systems. Currently we are targeting OpenMP applications due to the ease of use this programming model provides. In this paper we show the performance of a set of regular applications that perform well on our SDSM system. They were adapted from OpenCL codes provided by ATI, and re-written in OpenMP. When trying to exploit more complex applications with different data access patterns, we find more difficulties from a DSM system. As an example, we show the performance evaluation of the NAS MG benchmark, and two techniques we have developed to improve its data locality. Our SDSM infrastructure is composed of NanosDSM, an everything-shared SDSM developed at the Technical University of Catalonia (UPC) and the Barcelona Supercomputing Center (BSC), and the IBM XL SMP Runtime to allow the execution of the OpenMP applications.

languages and compilers for parallel computing | 2002

Removing impediments to loop fusion through code transformations

Bob Blainey; Christopher Barton; José Nelson Amaral

Loop fusion is a common optimization technique that takes several loops and combines them into a single large loop. Most of the existing work on loop fusion concentrates on the heuristics required to optimize an objective function, such. as data reuse or creation of instruction level parallelism opportunities. Often, however, the code provided to a compiler has only small sets of loops that are control flow equivalent, normalized, have the same iteration count, are adjacent, and have no fusion-preventing dependences. This paper focuses on code transformations that create more opportunities for loop fusion in the IBM®XL compiler suite that generates code for the IBM family of PowerPC®processors. In this compiler an objective function is used at the loop distributor to decide which portions of a loop should remain in the same loop nest and which portions should be redistributed. Our algorithm focuses on eliminating conditions that prevent loop fusion. By generating maximal fusion our algorithm increases the scope of later transformations. We tested our improved code generator in an IBM pSeries 690 machine equipped with a POWER4 processor using the SPEC CPU2000 benchmark suite. Our improvements to loop fusion resulted in three times as many loops fused in a subset of CFP2000 benchmarks, and four times as many for a subset of CINT2000 benchmarks.

Archive | 2012