Michael Boyer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Boyer is active.

Explore More

Publication

Featured researches published by Michael Boyer.

ieee international symposium on workload characterization | 2009

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che; Michael Boyer; Jiayuan Meng; David Tarjan; Jeremy W. Sheaffer; Sang-Ha Lee; Kevin Skadron

This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeleys dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

Journal of Parallel and Distributed Computing | 2008

A performance study of general-purpose applications on graphics processors using CUDA

Shuai Che; Michael Boyer; Jiayuan Meng; David Tarjan; Jeremy W. Sheaffer; Kevin Skadron

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIAs C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.

ieee international symposium on workload characterization | 2010

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads

Shuai Che; Jeremy W. Sheaffer; Michael Boyer; Lukasz G. Szafaryn; Liang Wang; Kevin Skadron

The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higher levels of acceptance, it becomes important that researchers understand this new set of benchmarks, especially in how they differ from previous work. In this paper, we present recent extensions to Rodinia and conduct a detailed characterization of the Rodinia benchmarks (including performance results on an NVIDIA GeForce GTX480, the first product released based on the Fermi architecture). We also compare and contrast Rodinia with Parsec to gain insights into the similarities and differences of the two benchmark collections; we apply principal component analysis to analyze the application space coverage of the two suites. Our analysis shows that many of the workloads in Rodinia and Parsec are complementary, capturing different aspects of certain performance metrics.

international parallel and distributed processing symposium | 2009

Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors

Michael Boyer; David Tarjan; Scott T. Acton; Kevin Skadron

The availability of easily programmable manycore CPUs and GPUs has motivated investigations into how to best exploit their tremendous computational power for scientific computing. Here we demonstrate how a systems biology application—detection and tracking of white blood cells in video microscopy—can be accelerated by 200× using a CUDA-capable GPU. Because the algorithms and implementation challenges are common to a wide range of applications, we discuss general techniques that allow programmers to make efficient use of a manycore GPU.

design automation conference | 2008

Federation: repurposing scalar cores for out-of-order instruction issue

David Tarjan; Michael Boyer; Kevin Skadron

Future SoCs will contain multiple cores. For workloads with significant parallelism, prior work has shown the benefit of many small, multi-threaded, scalar cores. For workloads that require better single-thread performance, a dedicated, larger core can help but comes at a large opportunity cost in the number of scalar cores that could be provisioned instead. This paper proposes a way to repurpose a pair of scalar cores into a 2-way out-of-order issue core with minimal area overhead. Federating scalar cores in this way nevertheless achieves comparable performance to a dedicated out-of-order core and dissipates less power as well.

computing frontiers | 2013

Load balancing in a changing world: dealing with heterogeneity and performance variability

Michael Boyer; Kevin Skadron; Shuai Che; Nuwan Jayasena

Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to respond to fluctuations in device performance that occur at run time. We present a novel dynamic approach to work partitioning that requires no offline training and responds automatically to performance variability to provide consistently good performance. Using six diverse OpenCL#8482; applications, we demonstrate the effectiveness of our approach in scenarios both with and without run-time performance variability, as well as in more extreme scenarios in which one device is non-functional.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Improving GPU Performance Prediction with Data Transfer Modeling

Michael Boyer; Jiayuan Meng; Kalyan Kumaran

Accelerators such as graphics processors (GPUs) have become increasingly popular for high performance scientific computing. Often, much effort is invested in creating and optimizing GPU code without any guaranteed performance benefit. To reduce this risk, performance models can be used to project a kernels GPU performance potential before it is ported. However, raw GPU execution time is not the only consideration. The overhead of transferring data between the CPU and the GPU is also an important factor; for some applications, this overhead may even erase the performance benefits of GPU acceleration. To address this challenge, we propose a GPU performance modeling framework that predicts both kernel execution time and data transfer time. Our extensions to an existing GPU performance model include a data usage analyzer for a sequence of GPU kernels, to determine the amount of data that needs to be transferred, and a performance model of the PCIe bus, to determine how long the data transfer will take. We have tested our framework using a set of applications running on a production machine at Argonne National Laboratory. On average, our model predicts the data transfer overhead with an error of only 8%, and the inclusion of data transfer time reduces the error in the predicted GPU speedup from 255% to 9%.

ACM Transactions on Architecture and Code Optimization | 2010

Federation: Boosting per-thread performance of throughput-oriented manycore architectures

Michael Boyer; David Tarjan; Kevin Skadron

Manycore architectures designed for parallel workloads are likely to use simple, highly multithreaded, in-order cores. This maximizes throughput, but only with enough threads to keep hardware utilized. For applications or phases with more limited parallelism, we describe creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores. We reuse the large register file in the multithreaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. The resulting federated core provides twice the single-thread performance of the underlying in-order core, allowing the architecture to efficiently support a wider range of parallelism.

Archive | 2011