Is this you? Create Your Porfile

Joseph B. Manzano

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joseph B. Manzano is active.

Explore More

Publication

Featured researches published by Joseph B. Manzano.

international conference on parallel and distributed systems | 2008

A Quantitative Study of the On-Chip Network and Memory Hierarchy Design for Many-Core Processor

Xu Wang; Ge Gan; Joseph B. Manzano; Dongrui Fan; Shuxu Guo

In this paper, we will study the on-chip network and memory hierarchy design of the Godson-T - a homogeneous many-core processor. Godson-T has 64 cores (with private L1 cache), and 16 global L2 cache banks. All these on-chip units are connected by a 2D 8 × 8 mesh network. Our study reveals that:(a) Global on-chip L2 cache can effectively alleviate the memory pressure caused by the data-thirsty on-chip computing engines. However, its potential is still limited by both the off-chip and the in-chip bandwidth, especially when increasing the number of active threads.(b) On-chip traffic congestion is largely caused by the intensive memory access requests issued from the on-chipcores. Therefore, the design of the on-chip network must consider the available performance of the datapath that connects the processor to the main memory. (c) In theory, different applications have different communication patterns (Berkeleys view). However, the applications runtime communication pattern is only determined by the design of the underlying memory hierarchy and on-chip interconnection. These conclusions are generally applicable to a wide variety of many-core processors with similar design.

european conference on parallel processing | 2010

A study of a software cache implementation of the OpenMP memory model for multicore and manycore architectures

Chen Chen; Joseph B. Manzano; Ge Gan; Guang R. Gao; Vivek Sarkar

This paper is motivated by the desire to provide an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. In this paper, we propose an instantiation of the OpenMP memory model with the following advantages: (1) The proposed instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. (2) The proposed instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. (3) The proposed instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0. We also introduce a new cache protocol for this instantiation, which can be implemented as a software-controlled cache. Experimental results on the Cell Broadband Engine show that our instantiation results in nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations.

international workshop on openmp | 2009

Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP

Ge Gan; Xu Wang; Joseph B. Manzano; Guang R. Gao

Tiling is widely used by compilers and programmer to optimize scientific and engineering code for better performance. Many parallel programming languages support tile/tiling directly through first-class language constructs or library routines. However, the current OpenMP programming language is tile oblivious , although it is the de facto standard for writing parallel programs on shared memory systems. In this paper, we introduce tile aware parallelization into OpenMP. We propose tile reduction , an OpenMP tile aware parallelization technique that allows reduction to be performed on multi-dimensional arrays. The paper has three contributions: (a) it is the first paper that proposes and discusses tile aware parallelization in OpenMP. We argue that, it is not only necessary but also possible to have tile aware parallelization in OpenMP; (b) the paper introduces the methods used to implement tile reduction, including the required OpenMP API extension and the associated code generation techniques; (c) we have applied tile reduction on a set of benchmarks. The experimental results show that tile reduction can make parallelization more natural and flexible. It not only can expose more parallelism in a program, but also can improve its data locality.

european conference on parallel processing | 2009

Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

Ge Gan; Xu Wang; Joseph B. Manzano; Guang R. Gao

Programming a multicore processor is difficult. It is even more difficult if the processor has software-managed memory hierarchy, e.g. the IBM Cyclops-64 (C64). A widely accepted parallel programming solution for multicore processor is OpenMP. Currently, all OpenMP directives are only used to decompose computation code (such as loop iterations, tasks, code sections, etc.). None of them can be used to control data movement, which is crucial for the C64 performance. In this paper, we propose a technique called tile percolation . This method provides the programmer with a set of OpenMP pragma directives. The programmer can use these directives to annotate their program to specify where and how to perform data movement. The compiler will then generate the required code accordingly. Our method is a semi-automatic code generation approach intended to simplify a programmers work. The paper provides (a) an exploration of the possibility of developing pragma directives for semi-automatic data movement code generation in OpenMP; (b) an introduction of techniques used to implement tile percolation including the programming API, the code generation in compiler, and the required runtime support routines; (c) and an evaluation of tile percolation with a set of benchmarks. Our experimental results show that tile percolation can make the OpenMP programs run on the C64 chip more efficiently.

symposium on code generation and optimization | 2015

Locality aware concurrent start for stencil applications

Sunil Shrestha; Guang R. Gao; Joseph B. Manzano; Andres Marquez; John Feo

Stencil computations are at the heart of many physical simulations used in scientific codes. Thus, there exists a plethora of optimization efforts for this family of computations. Among these techniques, tiling techniques that allow concurrent start have proven to be very efficient in providing better performance for these critical kernels. Nevertheless, with many core designs being the norm, these optimization techniques might not be able to fully exploit locality (both spatial and temporal) on multiple levels of the memory hierarchy without compromising parallelism. It is no longer true that the machine can be seen as a homogeneous collection of nodes with caches, main memory and an interconnect network. New architectural designs exhibit complex grouping of nodes, cores, threads, caches and memory connected by an ever evolving network-on-chip design. These new designs may benefit greatly from carefully crafted schedules and groupings that encourage parallel actors (i.e. threads, cores or nodes) to be aware of the computational history of other actors in close proximity. In this paper, we provide an efficient tiling technique that allows hierarchical concurrent start for memory hierarchy aware tile groups. Each execution schedule and tile shape exploit the available parallelism, load balance and locality present in the given applications. We demonstrate our technique on the Intel Xeon Phi architecture with selected and representative stencil kernels. We show improvement ranging from 5.58% to 31.17% over existing state-of-the-art techniques.

languages and compilers for parallel computing | 2014

Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

Sunil Shrestha; Joseph B. Manzano; Andres Marquez; John Feo; Guang R. Gao

In this paper, we have developed a novel methodology that takes into consideration multithreaded many-core designs to better utilize memory/processing resources and improve memory residence on tileable applications. It takes advantage of polyhedral analysis and transformation in the form of PLUTO [6], combined with a highly optimized fine grain tile runtime to exploit parallelism at all levels. The main contributions of this paper include the introduction of multi-hierarchical tiling techniques that increases intra tile parallelism; and a data-flow inspired runtime library that allows the expression of parallel tiles with an efficient synchronization registry. Our current implementation shows performance improvements on an Intel Xeon Phi board up to 32.25 % against instances produced by state-of-the-art compiler frameworks for selected stencil applications.

IEEE Transactions on Parallel and Distributed Systems | 2012

Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer

Oreste Villa; Antonino Tumeo; Simone Secchi; Joseph B. Manzano

Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2, and XMT, appear to address irregular application requirements better than commodity clusters. However, the research on massively multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy, and customization. At the same time, Shared Memory MultiProcessors (SMPs) with multicore processors have become an attractive platform to simulate large-scale systems. This paper introduces a cycle-level simulator of the massively multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques implemented to obtain high-simulation speed while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at runtime and includes a parametric network and memory model that takes into account contention and hot spotting. On a modern 48-core SMP host, the proposed infrastructure simulates a large set of irregular applications 500 to 2,000 times slower than real time when compared to a 128-processor XMT, with an accuracy error under 10 percent. Emulation is only from 25 to 200 times slower than real time. The paper also presents a case study, where the simulation infrastructure is used to identify bottlenecks in the current XMT architecture and to estimate the performance scaling of a possible multicore design with next generation memory and network interconnect.

international workshop on openmp | 2007

Toward an Automatic Code Layout Methodology

Joseph B. Manzano; Ziang Hu; Yi Jiang; Ge Gan; Hyo-jung Song; Jung-Gyu Park

This paper presents a study on an automatic code layout methodology for multi core architectures with explicit memory hierarchies. Code layout techniques are employed to run large programs on such systems. This study shows the effects of different buffer schemes and replacement policies on a set of benchmarks. These schemes are considered to be more flexible than current approaches. Moreover, these schemes can be used as a foundation to build frameworks for high level parallel programming models in such multi core architectures. The current work has been implemented in IBMs Cell Broadband Engine, but it can be extended to similar architectures.

computing frontiers | 2015

Optimizing irregular applications for energy and performance on the Tilera many-core architecture

Daniel G. Chavarría-Miranda; Ajay Panyala; Mahantesh Halappanavar; Joseph B. Manzano; Antonino Tumeo

Optimizing applications simultaneously for energy and performance is a complex problem. High performance, parallel, irregular applications are notoriously hard to optimize due to their data-dependent memory accesses, lack of structured locality and complex data structures and code patterns. Irregular kernels are growing in importance in applications such as machine learning, graph analytics and combinatorial scientific computing. Performance- and energy-efficient implementation of these kernels on modern, energy efficient, many-core platforms is therefore an important and challenging problem. We present results from optimizing two irregular applications -- the Louvain method for community detection (Grappolo), and high-performance conjugate gradient (HPCCG) -- on the Tilera many-core system. We have significantly extended MITs OpenTuner auto-tuning framework to conduct a detailed study of platform-independent and platform-specific optimizations to improve performance as well as reduce total energy consumption. We explore the optimization design space along three dimensions: memory layout schemes, compiler-based code transformations, and optimization of parallel loop schedules. Using auto-tuning, we demonstrate whole-node energy savings of up to 41% relative to a baseline instantiation, and up to 31% relative to manually optimized variants.

application-specific systems, architectures, and processors | 2015

Power and performance trade-offs for Space Time Adaptive Processing

Nitin A. Gawande; Joseph B. Manzano; Antonino Tumeo; Nathan R. Tallent; Darren J. Kerbyson; Adolfy Hoisie

Power efficiency - performance relative to power - is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAPs computationally intensive kernels across the two hardware testbeds. We discuss an efficient parallel implementation for the Haswell CPU architecture. We also show the impact and trade-offs of GPU optimization techniques. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. Finally, we show that a balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.

Explore More