John Kevin Patrick O'Brien

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John Kevin Patrick O'Brien is active.

Explore More

Publication

Featured researches published by John Kevin Patrick O'Brien.

Ibm Journal of Research and Development | 2015

Active Memory Cube: A processing-in-memory architecture for exascale systems

Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing

languages and compilers for parallel computing | 2006

Optimizing the use of static buffers for DMA on a CELL chip

Tong Chen; Zehra Sura; Kathryn M. O'Brien; John Kevin Patrick O'Brien

10^{18}

international conference on supercomputing | 2009

DBDB: optimizing DMATransfer for the cell be architecture

Tao Liu; Haibo Lin; Tong Chen; John Kevin Patrick O'Brien; Ling Shao

floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.

international conference on parallel architectures and compilation techniques | 2010

DMATiler: revisiting loop tiling for direct memory access

Haibo Lin; Tao Liu; Huoding Li; Tong Chen; Lakshminarayanan Renganarayana; John Kevin Patrick O'Brien; Ling Shao

The CELL architecture has one Power Processor Element (PPE) core, and eight Synergistic Processor Element (SPE) cores that have a distinct instruction set architecture of their own. The PPE core accesses memory via a traditional caching mechanism, but each SPE core can only access memory via a small 256K software-controlled local store. The PPE cache and SPE local stores are connected to each other and main memory via a high bandwidth bus. Software is responsible for all data transfers to and from the SPE local stores. To hide the high latency of DMA transfers, data may be prefetched into SPE local stores using loop blocking transformations and static buffers. We find that the performance of an application can vary depending on the size of the buffers used, and whether a single-, double-, or triple-buffer scheme is used. Constrained by the limited space available for data buffers in the SPE local store, we want to choose the optimal buffering scheme for a given space budget. Also, we want to be able to determine the optimal buffer size for a given scheme, such that using a larger buffer size results in negligible performance improvement. We develop a model to automatically infer these parameters for static buffering, taking into account the DMA latency and transfer rates, and the amount of computation in the application loop being targeted. We test the accuracy of our prediction model using a research prototype compiler developed on top of the IBM XL compiler infrastructure.

Ibm Systems Journal | 2006

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine TM architecture

Alexandre E. Eichenberger; John Kevin Patrick O'Brien; Kathryn M. O'Brien; Peng Wu; Tong Chen; P.H. Oden; D.A. Prener; Janice C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; P. Zhao; Michael Karl Gschwind; Roch Georges Archambault; Y. Gao; R. Koo

In heterogeneous multi-core systems, such as the Cell BE or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence. It is softwares responsibility to dynamically transfer the working set when the total data set is too large to fit in the local memory. The data can be transferred through a software controlled cache which maintains correctness and exploits reuse among references, especially when complicated aliasing or data dependence exists. However, the software controlled cache introduces the extra overhead of cache lookup. In this paper we present the design and implementation of a Direct Blocking Data Buffer (DBDB) which combines compiler analysis and runtime management to optimize local memory utilization. We use compile time analysis to identify regular references in a loop body, block the innermost loop according to the access patterns and available local memory space, insert DMA operations for the blocked loop, and substitute references to local buffers. The runtime is responsible for allocating local memory for DBDB, especially for disambiguating aliased memory accesses which could not be resolved at compile time. We further optimize noncontiguous references by taking advantage of the DMA-list feature provided by the Cell BE. A practical performance model is presented to guide the DMA transfer scheme selection among single-DMA, multi-DMA and DMA-list. We have implemented DBDB in the IBM XL C/C++ for Multicore Acceleration for Linux, and have conducted experiments with selected test cases from the NAS OpenMP and SPEC benchmarks. The results show that our method performs well compared with traditional software cache approach. We have observed a speedup of up to 5.3x and 4x in average.

Archive | 1995

Executing speculative parallel instructions threads with forking and inter-thread communication

Pradeep Dubey; Charles Barton; Chiao-Mei Chuang; Linh Lam; John Kevin Patrick O'Brien; Kathryn M. O'Brien

In this paper we present the design and implementation of a DMATiler which combines compiler analysis and runtime management to optimize local memory performance. In traditional cache model based loop tiling optimizations, the compiler approximates runtime cache misses as the number of distinct cache lines touched by a loop nest. In contrast, the DMATiler has the full control of the addresses, sizes, and sequences of data transfers. DMATiler uses a simplified DMA performance model to formulate the cost model for DMA-tiled loop nests, then solves it using a custom gradient descent algorithm with heuristics guided by DMA characteristics. Given a loop nest, DMATiler uses loop interchange to make the loop order more friendlier for data movements. Moreover, DMATiler applies compressed data buffer and advanced DMA command to further optimize data transfers. We have implemented the DMATiler in the IBM XL C/C++ for Multi-core Acceleration for Linux, and have conducted experiments with a set of loop nest benchmarks. The results show DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.

Archive | 2002