Norman Rubin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Norman Rubin is active.

Explore More

Publication

Featured researches published by Norman Rubin.

architectural support for programming languages and operating systems | 2012

Enabling task-level scheduling on heterogeneous platforms

Enqiang Sun; Dana Schaa; Richard J. Bagley; Norman Rubin; David R. Kaeli

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic task scheduling and data consistency when multiple devices appear in the system. To address this need, we have designed a task queueing extension for OpenCL that provides a high-level, unified execution model tightly coupled with a resource management facility. The main motivation for developing this extension is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible scheduling schemes. To demonstrate the value and utility of this extension, we have utilized an advanced OpenCL-based imaging toolkit called clSURF. Using our task queueing extension, we demonstrate the potential performance opportunities and limitations given current vendor implementations of OpenCL. Using a state-of-art implementation on a single GPU device as the baseline, our task queueing extension achieves a speedup up to 72.4%. Our extension also achieves scalable performance gains on multiple heterogeneous GPU devices. The performance trade-offs of using the host CPU as an accelerator are also evaluated.

general purpose processing on graphics processing units | 2011

Analyzing program flow within a many-kernel OpenCL application

Perhaad Mistry; Chris Gregg; Norman Rubin; David R. Kaeli; Kim M. Hazelwood

Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and profile the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. Our profiling framework is developed using built-in OpenCL API function calls, without the need for an external profiler. We show we can begin to identify performance bottlenecks and performance issues present in individual components on different hardware platforms. We demonstrate that by using run-time profiling using the OpenCL specification, we can provide an application developer with a fine-grained look at performance, and that this information can be used to tailor performance improvements for specific platforms.

general purpose processing on graphics processing units | 2011

A new method for GPU based irregular reductions and its application to k-means clustering

Balaji Dhanasekaran; Norman Rubin

A frequently used method of clustering is a technique called k-means clustering. The k-means algorithm consists of two steps: A map step, which is simple to execute on a GPU, and a reduce step, which is more problematic. Previous researchers have used a hybrid approach in which the map step is computed on the GPU and the reduce step is performed on the CPU. In this work, we present a new algorithm for irregular reductions and apply it to k-means such that the GPU executes both the map and reduce steps. We provide experimental comparisons using OpenCL. Our results show that our scheme is 3.2 times faster than the hybrid scheme for k = 10, an average 1.5 times faster when the number of clusters, k = 100 and on average equal for k = 400, on an ATI Radeon® HD 5870 (best speedup was 3.5 times) compared to the hybrid approach. In addition, we compare the GPU code with the standard OpenMP benchmark, MineBench. In that implementation, both the map and reduce steps are computed on the CPU. For large data sizes, the new GPU scheme shows great promise, with performance up to 35 times faster than MineBench on a four core Intel i7 CPU.

international parallel and distributed processing symposium | 2014

A Case for a Flexible Scalar Unit in SIMT Architecture

Yi Yang; Ping Xiang; Michael Mantor; Norman Rubin; Lisa R. Hsu; Qunfeng Dong; Huiyang Zhou

The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMDs latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMT style instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.

international symposium on performance analysis of systems and software | 2013

Characterizing scalar opportunities in GPGPU applications

Zhongliang Chen; David R. Kaeli; Norman Rubin

General Purpose computing with Graphics Processing Units (GPGPU) has gained widespread adoption in both the high performance and general purpose communities. In most GPU computation, execution exploits a Single Instruction Multiple Data (SIMD) model. However, GPU execution typically pays little attention to whether the data operated upon by the SIMD units is the same or different. When SIMD computation operates on multiple copies of the same data, redundant computations are generated. It provides an opportunity to improve efficiency by just broadcasting the results of a single computation to multiple outputs. To better serve those operations, modern GPUs are armed with scalar units. Then SIMD instructions that are operating on the same input data operands will be directed to execute upon scalar units, requiring only a single copy of the data, and leaving the data-parallel SIMD units available to execute non-scalar operations. In this paper, we first characterize a number of CUDA programs taken from the NVIDIA SDK to quantify the potential for scalar execution. We observe that 38% of static SIMD instructions are recognized to operate on the same data by the compiler, and their dynamic occurences account for 34% of the total dynamic instruction execution. We then evaluate the impact of scalar units on a heterogeneous scalar-vector GPU architecture. Our results show that scalar units are utilized 51% of the time during execution, though their use places additional pressure on the interconnect and memory, as shown in the results of our study.

architectural support for programming languages and operating systems | 2013

Accelerating simulation of agent-based models on heterogeneous architectures

Jin Wang; Norman Rubin; Haicheng Wu; Sudhakar Yalamanchili

The wide usage of GPGPU programming models and compiler techniques enables the optimization of data-parallel programs on commodity GPUs. However, mapping GPGPU applications running on discrete parts to emerging integrated heterogeneous architectures such as the AMD Fusion APU and Intel Sandy/Ivy bridge with the CPU and the GPU on the same die has not been well studied. Classic time-step simulation applications represented by agent-based models have the intrinsic parallel structure that is a good fit for GPGPU architectures. However, when mapping these applications directly to the integrated GPUs, the performance may degrade due to less computation units and lower clock speed. This paper proposes an optimization to the GPGPU implementation of the agent-based model and illustrates it in the traffic simulation example. The optimization adapts the algorithm by moving part of the workload to the CPU to leverage the integrated architecture and the on-chip memory bus which is faster than the PCIe bus that connects the discrete GPU and the host. The experiments on discrete AMD Radeon GPU and AMD Fusion APU demonstrate that the optimization can achieve 1.08--2.71x performance speedup on the integrated architecture over the discrete platform.

international conference on computer graphics and interactive techniques | 2010

ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs

Budirijanto Purnomo; Norman Rubin; Mike Houston

Modern GPUs have been shown to be highly efficient machines for data-parallel applications such as graphics, image, video processing, or physical simulation applications. For example, a single ATI Radeon#8482; HD 5870 GPU has a theoretical peak of 2.72 teraflops (1012 floating-point operations per second) with a video memory bandwidth of 153.6 GB/s. While it is not difficult to port CPU algorithms to run on GPUs, it is extremely challenging to optimize the algorithms to achieve teraflops performance on GPUs. Only a select few expert engineers with the application domain expertise, a deep understanding of the modern GPU architecture, and an intimate knowledge of shader compiler optimization can program GPUs close to their optimal capabilities. Many developers are content with several folds of improvements rather than one or several orders of magnitude acceleration compared to their optimized CPU implementations.

architectural support for programming languages and operating systems | 2014

ParallelJS: An Execution Framework for JavaScript on Heterogeneous Systems

Jin Wang; Norman Rubin; Sudhakar Yalamanchili

JavaScript has been recognized as one of the most widely used script languages. Optimizations of JavaScript engines on mainstream web browsers enable efficient execution of JavaScript programs on CPUs. However, running JavaScript applications on emerging heterogeneous architectures that feature massively parallel hardware such as GPUs has not been well studied. This paper proposes a framework for flexible mapping of JavaScript onto heterogeneous systems that have both CPUs and GPUs. The framework includes a frontend compiler, a construct library and a runtime system. JavaScript programs written with high-level constructs are compiled to GPU binary code and scheduled to GPUs by the runtime. Experiments show that the proposed framework achieves up to 26.8x speedup executing JavaScript applications on parallel GPUs over a mainstream web browser that runs on CPUs.

Archive | 2001