Kathryn M. O'Brien | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kathryn M. O'Brien is active.

Explore More

Publication

Featured researches published by Kathryn M. O'Brien.

international conference on parallel architectures and compilation techniques | 2005

Optimizing Compiler for the CELL Processor

Alexandre E. Eichenberger; Kathryn M. O'Brien; Peng Wu; Tong Chen; P.H. Oden; D.A. Prener; J.C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; Peng Zhao; Michael Karl Gschwind

Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.

International Journal of Parallel Programming | 2008

Supporting OpenMP on cell

Kevin O'Brien; Kathryn M. O'Brien; Zehra Sura; Tong Chen; Tao Zhang

Future generations of Chip Multiprocessors (CMP) will provide dozens or even hundreds of cores inside the chip. Writing applications that benefit from the massive computational power offered by these chips is not going to be an easy task for mainstream programmers who are used to sequential algorithms rather than parallel ones. This paper explores the possibility of using Transactional Memory (TM) in OpenMP, the industrial standard for writing parallel programs on shared-memory architectures, for C, C++ and Fortran. One of the major complexities in writing OpenMP applications is the use of critical regions (locks), atomic regions and barriers to synchronize the execution of parallel activities in threads. TM has been proposed as a mechanism that abstracts some of the complexities associated with concurrent access to shared data while enabling scalable performance. The paper presents a first proof-of-concept implementation of OpenMP with TM. Some language extensions to OpenMP are proposed to express transactions. These extensions are implemented in our source-to-source OpenMP Mercurium compiler and our Software Transactional Memory (STM) runtime system Nebelung that supports the code generated by Mercurium. Hardware Transactional Memory (HTM) or Hardware-assisted STM (HaSTM) are seen as possible paths to make the tandem TM-OpenMP more scalable. In the evaluation section we show the preliminary results. The paper finishes with a set of open issues that still need to be addressed, either in OpenMP or in the hardware/software implementations of TM.

languages and compilers for parallel computing | 2006

Optimizing the use of static buffers for DMA on a CELL chip

Tong Chen; Zehra Sura; Kathryn M. O'Brien; John Kevin Patrick O'Brien

The CELL architecture has one Power Processor Element (PPE) core, and eight Synergistic Processor Element (SPE) cores that have a distinct instruction set architecture of their own. The PPE core accesses memory via a traditional caching mechanism, but each SPE core can only access memory via a small 256K software-controlled local store. The PPE cache and SPE local stores are connected to each other and main memory via a high bandwidth bus. Software is responsible for all data transfers to and from the SPE local stores. To hide the high latency of DMA transfers, data may be prefetched into SPE local stores using loop blocking transformations and static buffers. We find that the performance of an application can vary depending on the size of the buffers used, and whether a single-, double-, or triple-buffer scheme is used. Constrained by the limited space available for data buffers in the SPE local store, we want to choose the optimal buffering scheme for a given space budget. Also, we want to be able to determine the optimal buffer size for a given scheme, such that using a larger buffer size results in negligible performance improvement. We develop a model to automatically infer these parameters for static buffering, taking into account the DMA latency and transfer rates, and the amount of computation in the application loop being targeted. We test the accuracy of our prediction model using a research prototype compiler developed on top of the IBM XL compiler infrastructure.

international conference on parallel architectures and compilation techniques | 2008

Hybrid access-specific software cache techniques for the cell BE architecture

Marc Gonzàlez; Nikola Vujic; Xavier Martorell; Eduard Ayguadé; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Tao Zhang; Kevin O'Brien; Kathryn M. O'Brien

Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

languages and compilers for parallel computing | 2007

A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor

Jairo Balart; Marc Gonzàlez; Xavier Martorell; Eduard Ayguadé; Zehra Sura; Tong Chen; Tao Zhang; Kevin O'Brien; Kathryn M. O'Brien

This paper describes the implementation of a runtime library for asynchronous communication in the Cell BE processor. The runtime library implementation provides with several services that allow the compiler to generate code, maximizing the chances for overlapping communication and computation. The library implementation is organized as a Software Cache and the main services correspond to mechanisms for data look up, data placement and replacement, data write back, memory synchronization and address translation. The implementation guarantees that all those services can be totally uncoupled when dealing with memory references. Therefore this provides opportunities to the compiler to organize the generated code in order to overlap as much as possible computation with communication. The paper also describes the necessary mechanism to overlap the communication related to write back operations with actual computation. The paper includes the description of the compiler basic algorithms and optimizations for code generation. The system is evaluated measuring bandwidth and global updates ratios, with two benchmarks from the HPCC benchmark suite: Stream and Random Access.

international conference on parallel architectures and compilation techniques | 2009

Exploiting Parallelism with Dependence-Aware Scheduling

Xiaotong Zhuang; Alexandre E. Eichenberger; Yangchun Luo; Kevin O'Brien; Kathryn M. O'Brien

It is well known that a large fraction of applications cannot be parallelized at compile time due to unpredictable data dependences such as indirect memory accesses and/or memory accesses guarded by data-dependent conditional statements. A significant body of prior work attempts to parallelize such applications using runtime data-dependence analysis and scheduling. Performance is highly dependent on the ratio of the dependence analysis overheads with respect to the actual amount of parallelism available in the code. We have found that the overheads are often high and the available parallelism is often low when evaluating applications on a modern multicore processor. We propose a novel software-based approach called dependence-aware scheduling to parallelize loops with unknown data dependences. Unlike prior work, our main goal is to reduce the negative impact of dependence computation, such that when there is not an opportunity of getting speedup, the code can still run without much slowdown. If there is an opportunity, dependence-aware scheduling is able to yield very impressive speedup. Our results indicate that dependence-aware scheduling can greatly improve performance, with up to 4x speedups, for a number of computation intensive applications. Furthermore, the results also show negligible slowdowns in a stress test, where parallelism is continuously detected but not exploited.

international workshop on openmp | 2007

Supporting OpenMP on Cell

Kevin O'Brien; Kathryn M. O'Brien; Zehra Sura; Tong Chen; Tao Zhang

The Cell processor is a heterogeneous multi-core processor with one Power Processing Engine (PPE) core and eight Synergistic Processing Engine (SPE) cores. Each SPE has a directly accessible small local memory (256K), and it can access the system memory through DMA operations. Cell programming is complicated both by the need to explicitly manage DMA data transfers for SPE computation, as well as the multiple layers of parallelism provided in the architecture, including heterogeneous cores, multiple SPE cores, multithreading, SIMD units, and multiple instruction issue. There is a significant amount of ongoing research in programming models and tools that attempts to make it easy to exploit the computation power of the Cell architecture. In our work, we explore supporting OpenMP on the Cell processor. OpenMP is a widely used API for parallel programming. It is attractive to support OpenMP because programmers can continue using their familiar programming model, and existing code can be re-used. We base our work on IBMs XL compiler, which already has OpenMP support for AIX multi-processor systems built with Power processors. We developed new components in the XL compiler and a new runtime library for Cell OpenMP that utilizes the Cell SDK libraries to target specific features of the new hardware platform. To describe the design of our Cell OpenMP implementation, we focus on three major issues in our system: 1) how to use the heterogeneous cores and synchronization support in the Cell to optimize OpenMP threads; 2) how to generate thread code targeting the different instruction sets of the PPE and SPE from within a compiler that takes single-source input; 3) how to implement the OpenMP memory model on the Cell memory system. We present experimental results for some SPEC OMP 2001 and NAS benchmarks to demonstrate the effectiveness of this approach. Also, we can observe detailed runtime event sequences using the visualization tool Paraver, and we use the insight into actual thread and synchronization behaviors to direct further optimizations.

Sigplan Notices | 1995

XIL and YIL: the intermediate languages of TOBEY

Kevin O'Brien; Kathryn M. O'Brien; Martin Edward Hopkins; Arvin Shepherd; Ronald C. Unrau

Typically, the choice of intermediate representation by a particular compiler implementation seeks to address a specific goal. The intermediate language of the TOBEY compilers, XIL, was initially chosen to facilitate the production of highly optimal scalar code, yet, it was easily extended to a higher level form YIL in order to support a new suite of optimizations which in most existing compilers are done at the level of source to source translation. In this paper we will discuss those design features of XIL that were important factors in the production of optimal scalar code. In addition we will demonstrate how the strength of the YIL abstraction lay in its ability to access the underlying low level representation.

languages, compilers, and tools for embedded systems | 2007

Issues and challenges in compiling for the CBEA

Kathryn M. O'Brien

It has been demonstrated that expert programmers can develop and hand tune applications to exploit the full performance potential of the CBE architecture. We believe that sophisticated compiler optimization technology can bridge the gap between usability and performance in this arena. To this end we have developed a research prototype compiler targeting the Cell processor. In this talk we describe a variety of compiler techniques to exploit the Cell processor. These techniques are aimed at automatically generating high quality codes for the heterogeneous parallelism available on the Cell processor. In particular we will focus the discussion on managing the small local memories of the SPEs and discuss our approach to presenting the user with a single shared memory image through our compiler controlled software cache. We will also report and discuss the results we have achieved to date, which indicate that significant speedup can be achieved on this processor with a high level of support from the compiler.

Ibm Systems Journal | 2006

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine TM architecture

Alexandre E. Eichenberger; John Kevin Patrick O'Brien; Kathryn M. O'Brien; Peng Wu; Tong Chen; P.H. Oden; D.A. Prener; Janice C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; P. Zhao; Michael Karl Gschwind; Roch Georges Archambault; Y. Gao; R. Koo

Explore More