Dana Schaa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dana Schaa is active.

Explore More

Publication

Featured researches published by Dana Schaa.

international conference on parallel architectures and compilation techniques | 2012

Multi2Sim: a simulation framework for CPU-GPU computing

Rafael Ubal; Byunghyun Jang; Perhaad Mistry; Dana Schaa; David R. Kaeli

Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMDs OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed users guide are publicly available at www.multi2sim.org.

IEEE Transactions on Parallel and Distributed Systems | 2011

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Byunghyun Jang; Dana Schaa; Perhaad Mistry; David R. Kaeli

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

international parallel and distributed processing symposium | 2009

Exploring the multiple-GPU design space

Dana Schaa; David R. Kaeli

Graphics Processing Units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIAs CUDA interface, are becoming the platform of choice in the scientific computing community. Previous studies that used GPUs focused on obtaining significant performance gains from execution on a single GPU. These studies employed low-level, architecture-specific tuning in order to achieve sizeable benefits over multicore CPU execution. In this paper, we consider the benefits of running on multiple (parallel) GPUs to provide further orders of performance speedup. Our methodology allows developers to accurately predict execution time for GPU applications while varying the number and configuration of the GPUs, and the size of the input data set. This is a natural next step in GPU computing because it allows researchers to determine the most appropriate GPU configuration for an application without having to purchase hardware, or write the code for a multiple-GPU implementation. When used to predict performance on six scientific applications, our framework produces accurate performance estimates (11% difference on average and 40% maximum difference in a single case) for a range of short and long running scientific programs.

architectural support for programming languages and operating systems | 2012

Enabling task-level scheduling on heterogeneous platforms

Enqiang Sun; Dana Schaa; Richard J. Bagley; Norman Rubin; David R. Kaeli

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic task scheduling and data consistency when multiple devices appear in the system. To address this need, we have designed a task queueing extension for OpenCL that provides a high-level, unified execution model tightly coupled with a resource management facility. The main motivation for developing this extension is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible scheduling schemes. To demonstrate the value and utility of this extension, we have utilized an advanced OpenCL-based imaging toolkit called clSURF. Using our task queueing extension, we demonstrate the potential performance opportunities and limitations given current vendor implementations of OpenCL. Using a state-of-art implementation on a single GPU device as the baseline, our task queueing extension achieves a speedup up to 72.4%. Our extension also achieves scalable performance gains on multiple heterogeneous GPU devices. The performance trade-offs of using the host CPU as an accelerator are also evaluated.

acm sigplan symposium on principles and practice of parallel programming | 2010

Data transformations enabling loop vectorization on multithreaded data parallel architectures

Byunghyun Jang; Perhaad Mistry; Dana Schaa; Rodrigo Dominguez; David R. Kaeli

Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2011

Accelerating an Imaging Spectroscopy Algorithm for Submerged Marine Environments Using Graphics Processing Units

James A. Goodman; David R. Kaeli; Dana Schaa

Remote sensing is utilized across a wide array of disciplines, including resource management, disaster relief planning, environmental assessment, and climate change impact analysis. The data volume and processing requirements associated with remote sensing are rapidly expanding as a result of the increasing number of satellite and airborne sensors, greater data accessibility, and expanded utilization of data intensive technologies such as imaging spectroscopy. However, due to the limited ability of current computing systems to gracefully scale with application requirements, particularly in the desktop level market, large amounts of data are currently underutilized or never explored. Computing limitations thus constrain our ability to efficiently and accurately address key science questions using remote sensing. The current evolution in general purpose computing on Graphics Processing Units (GPUs), an emerging technology that is redefining the field of high performance computing, facilitates significantly improved computing capabilities for current and future image analysis needs. We demonstrate the advantages of this technology by accelerating an imaging spectroscopy algorithm for submerged marine habitats using GPU computing. Results indicate that considerable improvement in performance can be achieved using a single GPU on a standard desktop computer. This technology has enormous potential for continued growth exploiting high performance computing, and provides the foundation for significantly enhanced remote sensing capabilities.

Heterogeneous Computing with OpenCL | 2013

Introduction to OpenCL

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter introduces OpenCL, the programming fabric that allows one to weave application to execute concurrently. It provides an introduction to the basics of using the OpenCL standard when developing parallel programs. It describes the four different abstraction models defined in the standard and presented examples of OpenCL implementations to place some of the abstraction in context. OpenCL describes execution in fine-grained work-items and can dispatch vast numbers of work-items on architectures with hardware support for fine-grained threading. It is easy to have concerns about scalability. The hierarchical concurrency model implemented by OpenCL ensures that scalable execution can be achieved even while supporting a large number of work items. Work items within a workgroup have a special relationship with one another. They can perform barrier operations to synchronize and they have access to a shared memory address space.

architectural support for programming languages and operating systems | 2013

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Perhaad Mistry; Yash Ukidave; Dana Schaa; David R. Kaeli

Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a CPU and GPU on the same chip). These emerging class of platforms are now being targeted to accelerate applications where the host processor (typically a CPU) and compute device (typically a GPU) co-operate on a computation. In this scenario, the performance of the application is not only dependent on the processing power of the respective heterogeneous processors, but also on the efficient interaction and communication between them. To help architects and application developers to quantify many of the key aspects of heterogeneous execution, this paper presents a new set of benchmarks called the Valar. The Valar benchmarks are applications specifically chosen to study the dynamic behavior of OpenCL applications that will benefit from host-device interaction. We describe the general characteristics of our benchmarks, focusing on specific characteristics that can help characterize heterogeneous applications. For the purposes of this paper we focus on OpenCL as our programming environment, though we envision versions of Valar in additional heterogeneous programming languages. We profile the Valar benchmarks based on their mapping and execution on different heterogeneous systems. Our evaluation examines optimizations for host-device communication and the effects of closely-coupled execution of the benchmarks on the multiple OpenCL devices present in heterogeneous systems.

internaltional ultrasonics symposium | 2011

Synthetic Aperture Beamformation using the GPU

Jens Hansen; Dana Schaa; Jørgen Arendt Jensen

A synthetic aperture ultrasound beamformer is implemented for a GPU using the OpenCL framework. The implementation supports beamformation of either RF signals or complex baseband signals. Transmit and receive apodization can be either parametric or dynamic using a fixed F-number, a reference, and a direction. Images can be formed using an arbitrary number of emissions and receive channels. Data can be read from Matlab or directly from memory and the setup can be configured using Matlab. A large number of different setups has been investigated and the frame rate measured. A frame rate of 40 frames per second is obtained for full synthetic aperture imaging using 16 emissions and 64 receive channels for an image size of 512×512 pixels and 4000 complex 32-bit samples recorded at 40 MHz. This amount to a speed up of more than a factor of 6 compared to powerful workstation with 2 quad-core Xeon-processors.

symposium on computer architecture and high performance computing | 2007

Exploring Novel Parallelization Technologies for 3-D Imaging Applications

Diego Rivera; Dana Schaa; Micha Moffie; David R. Kaeli

In this paper, we propose two low-cost and novel branch history buffer handling schemes aiming at skewing the branch prediction accuracy in favor of a real-time thread for a soft real-time embedded multithreaded processor. The processor core accommodates two running threads, one with the highest priority and the other thread is a background thread, and both threads share the branch predictor. The first scheme uses a 3-bit branch history buffer in which the highest priority thread uses the most significant 2 bits to change the prediction state while the background thread uses only the least significant 2 bits. The second scheme uses the shared 2-bit branch history buffer that implements integer updates for the highest priority thread but fractional updates for the background thread in order to achieve relatively higher prediction accuracy in the highest priority thread. The low cost nature of these two schemes, particularly in the second scheme, makes them attractive with moderate improvement in the performance of the highest priority thread.Multi-dimensional imaging techniques involve the processing of high resolution images commonly used in medical, civil and remote-sensing applications. A barrier commonly encountered in this class of applications is the time required to carry out repetitive operations on large matrices. Partitioning these large datasets can help improve performance, and lends the data to more efficient parallel execution. In this paper we describe our experience exploring two novel parallelization technologies: 1) a graphical processor unit (GPU)-based approach which utilizes 128 cores on a single GPU accelerator card, and 2) a middleware approach for semi-automatic parallelization on a cluster of multiple multi-core processors. We investigate these two platforms and describe their strengths and limitations. In addition, we provide some guidance to the programmer on which platform to use when porting multi-dimensional imaging applications. Using a 3-D application taken from a clinical image reconstruction algorithm, we demonstrate the degree of speedup we can obtain from these two approaches.

Explore More