Kevin O'Brien | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin O'Brien is active.

Explore More

Publication

Featured researches published by Kevin O'Brien.

programming language design and implementation | 2004

Vectorization for SIMD architectures with alignment constraints

Alexandre E. Eichenberger; Peng Wu; Kevin O'Brien

When vectorizing for SIMD architectures that are commonly employed by todays multimedia extensions, one of the new challenges that arise is the handling of memory alignment. Prior research has focused primarily on vectorizing loops where all memory references are properly aligned. An important aspect of this problem, namely, how to vectorize misaligned memory references, still remains unaddressed.This paper presents a compilation scheme that systematically vectorizes loops in the presence of misaligned memory references. The core of our technique is to automatically reorganize data in registers to satisfy the alignment requirement imposed by the hardware. To reduce the data reorganization overhead, we propose several techniques to minimize the number of data reorganization operations generated. During the code generation, our algorithm also exploits temporal reuse when aligning references that access contiguous memory across loop iterations. Our code generation scheme guarantees to never load the same data associated with a single static access twice. Experimental results indicate near peak speedup factors, e.g., 3.71 for 4 data per vector and 6.06 for 8 data per vector, respectively, for a set of loops where 75% or more of the static memory references are misaligned.

International Journal of Parallel Programming | 2008

Supporting OpenMP on cell

Kevin O'Brien; Kathryn M. O'Brien; Zehra Sura; Tong Chen; Tao Zhang

Future generations of Chip Multiprocessors (CMP) will provide dozens or even hundreds of cores inside the chip. Writing applications that benefit from the massive computational power offered by these chips is not going to be an easy task for mainstream programmers who are used to sequential algorithms rather than parallel ones. This paper explores the possibility of using Transactional Memory (TM) in OpenMP, the industrial standard for writing parallel programs on shared-memory architectures, for C, C++ and Fortran. One of the major complexities in writing OpenMP applications is the use of critical regions (locks), atomic regions and barriers to synchronize the execution of parallel activities in threads. TM has been proposed as a mechanism that abstracts some of the complexities associated with concurrent access to shared data while enabling scalable performance. The paper presents a first proof-of-concept implementation of OpenMP with TM. Some language extensions to OpenMP are proposed to express transactions. These extensions are implemented in our source-to-source OpenMP Mercurium compiler and our Software Transactional Memory (STM) runtime system Nebelung that supports the code generated by Mercurium. Hardware Transactional Memory (HTM) or Hardware-assisted STM (HaSTM) are seen as possible paths to make the tandem TM-OpenMP more scalable. In the evaluation section we show the preliminary results. The paper finishes with a set of open issues that still need to be addressed, either in OpenMP or in the hardware/software implementations of TM.

international conference on parallel architectures and compilation techniques | 2008

Hybrid access-specific software cache techniques for the cell BE architecture

Marc Gonzàlez; Nikola Vujic; Xavier Martorell; Eduard Ayguadé; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Tao Zhang; Kevin O'Brien; Kathryn M. O'Brien

Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

languages and compilers for parallel computing | 2007

A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor

Jairo Balart; Marc Gonzàlez; Xavier Martorell; Eduard Ayguadé; Zehra Sura; Tong Chen; Tao Zhang; Kevin O'Brien; Kathryn M. O'Brien

This paper describes the implementation of a runtime library for asynchronous communication in the Cell BE processor. The runtime library implementation provides with several services that allow the compiler to generate code, maximizing the chances for overlapping communication and computation. The library implementation is organized as a Software Cache and the main services correspond to mechanisms for data look up, data placement and replacement, data write back, memory synchronization and address translation. The implementation guarantees that all those services can be totally uncoupled when dealing with memory references. Therefore this provides opportunities to the compiler to organize the generated code in order to overlap as much as possible computation with communication. The paper also describes the necessary mechanism to overlap the communication related to write back operations with actual computation. The paper includes the description of the compiler basic algorithms and optimizations for code generation. The system is evaluated measuring bandwidth and global updates ratios, with two benchmarks from the HPCC benchmark suite: Stream and Random Access.

symposium on code generation and optimization | 2010

Automatic creation of tile size selection models

Tomofumi Yuki; Lakshminarayanan Renganarayanan; Sanjay V. Rajopadhye; Charles W. Anderson; Alexandre E. Eichenberger; Kevin O'Brien

Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by hand-crafting tile size selection (TSS) models that characterize the performance of the tiled program as a function of tile sizes. The best tile sizes are selected by either directly using the TSS model or by using the TSS model together with an empirical search. Hand-crafting accurate TSS models is hard, and adapting them to different architecture/compiler, or even keeping them up-to-date with respect to the evolution of a single compiler is often just as hard. Instead of hand-crafting TSS models, can we automatically learn or create them? In this paper, we show that for a specific class of programs fairly accurate TSS models can be automatically created by using a combination of simple program features, synthetic kernels, and standard machine learning techniques. The automatic TSS model generation scheme can also be directly used for adapting the model and/or keeping it up-to-date. We evaluate our scheme on six different architecture-compiler combinations (chosen from three different architectures and four different compilers). The models learned by our method have consistently shown near-optimal performance (within 5% of the optimal on average) across all architecture-compiler combinations.

Proceedings of the 2014 LLVM Compiler Infrastructure in HPC on | 2014

Coordinating GPU threads for OpenMP 4.0 in LLVM

Carlo Bertolli; Samuel F. Antao; Alexandre E. Eichenberger; Kevin O'Brien; Zehra Sura; Arpith C. Jacob; Tong Chen; Olivier Sallenave

GPUs devices are becoming critical building blocks of High-Performance platforms for performance and energy efficiency reasons. As a consequence, parallel programming environment such as OpenMP were extended to support offloading code to such devices. OpenMP compilers are faced with offering an efficient implementation of device-targeting constructs. One main issue in implementing OpenMP on a GPU is related to efficiently supporting sequential and parallel regions, as GPUs are only optimized to execute highly parallel workloads. Multiple solutions to this issue were proposed in previous research. In this paper, we propose a method to coordinate threads in an NVIDIA GPU that is both efficient and easily integrated as part of a compiler. To support our claims, we developed CUDA programs that mimic multiple coordination schemes and we compare their performances. We show that a scheme based on dynamic parallelism performs poorly compared to inspector-executor schemes that we introduce in this paper. We also discuss how to integrate these schemes to the LLVM compiler infrastructure.

international conference on parallel architectures and compilation techniques | 2009

Exploiting Parallelism with Dependence-Aware Scheduling

Xiaotong Zhuang; Alexandre E. Eichenberger; Yangchun Luo; Kevin O'Brien; Kathryn M. O'Brien

It is well known that a large fraction of applications cannot be parallelized at compile time due to unpredictable data dependences such as indirect memory accesses and/or memory accesses guarded by data-dependent conditional statements. A significant body of prior work attempts to parallelize such applications using runtime data-dependence analysis and scheduling. Performance is highly dependent on the ratio of the dependence analysis overheads with respect to the actual amount of parallelism available in the code. We have found that the overheads are often high and the available parallelism is often low when evaluating applications on a modern multicore processor. We propose a novel software-based approach called dependence-aware scheduling to parallelize loops with unknown data dependences. Unlike prior work, our main goal is to reduce the negative impact of dependence computation, such that when there is not an opportunity of getting speedup, the code can still run without much slowdown. If there is an opportunity, dependence-aware scheduling is able to yield very impressive speedup. Our results indicate that dependence-aware scheduling can greatly improve performance, with up to 4x speedups, for a number of computation intensive applications. Furthermore, the results also show negligible slowdowns in a stress test, where parallelism is continuously detected but not exploited.

computing frontiers | 2015

Data access optimization in a processing-in-memory system

Zehra Sura; Arpith C. Jacob; Tong Chen; Bryan S. Rosenburg; Olivier Sallenave; Carlo Bertolli; Samuel F. Antao; José R. Brunheroto; Yoonho Park; Kevin O'Brien; Ravi Nair

The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.

ieee international conference on high performance computing data and analytics | 2015

Performance analysis of OpenMP on a GPU using a CORAL proxy application

Gheorghe-Teodor Bercea; Carlo Bertolli; Samuel F. Antao; Arpith C. Jacob; Alexandre E. Eichenberger; Tong Chen; Zehra Sura; Hyojin Sung; Georgios Rokos; David Appelhans; Kevin O'Brien

OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of directives and how they compare to versions manually implemented and tuned to a specific hardware accelerator. In this paper we analyze the performance of our implementation of the OpenMP 4.0 constructs on an NVIDIA GPU. For performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the CORAL benchmark suite. NVIDIA provides CUDA as a native programming model for GPUs. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a functionally equivalent CUDA implementation. Alongside our performance analysis we also present the tuning steps required to obtain good performance when porting existing applications to a new accelerator architecture. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code-synthesis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within as low as 10% of native CUDA C/C++ for application kernels with low register counts.

international workshop on openmp | 2007

Supporting OpenMP on Cell

Kevin O'Brien; Kathryn M. O'Brien; Zehra Sura; Tong Chen; Tao Zhang

The Cell processor is a heterogeneous multi-core processor with one Power Processing Engine (PPE) core and eight Synergistic Processing Engine (SPE) cores. Each SPE has a directly accessible small local memory (256K), and it can access the system memory through DMA operations. Cell programming is complicated both by the need to explicitly manage DMA data transfers for SPE computation, as well as the multiple layers of parallelism provided in the architecture, including heterogeneous cores, multiple SPE cores, multithreading, SIMD units, and multiple instruction issue. There is a significant amount of ongoing research in programming models and tools that attempts to make it easy to exploit the computation power of the Cell architecture. In our work, we explore supporting OpenMP on the Cell processor. OpenMP is a widely used API for parallel programming. It is attractive to support OpenMP because programmers can continue using their familiar programming model, and existing code can be re-used. We base our work on IBMs XL compiler, which already has OpenMP support for AIX multi-processor systems built with Power processors. We developed new components in the XL compiler and a new runtime library for Cell OpenMP that utilizes the Cell SDK libraries to target specific features of the new hardware platform. To describe the design of our Cell OpenMP implementation, we focus on three major issues in our system: 1) how to use the heterogeneous cores and synchronization support in the Cell to optimize OpenMP threads; 2) how to generate thread code targeting the different instruction sets of the PPE and SPE from within a compiler that takes single-source input; 3) how to implement the OpenMP memory model on the Cell memory system. We present experimental results for some SPEC OMP 2001 and NAS benchmarks to demonstrate the effectiveness of this approach. Also, we can observe detailed runtime event sequences using the visualization tool Paraver, and we use the insight into actual thread and synchronization behaviors to direct further optimizations.

Explore More