Norm Rubin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Norm Rubin is active.

Explore More

Publication

Featured researches published by Norm Rubin.

international conference on parallel architectures and compilation techniques | 2012

Shared memory multiplexing: a novel way to improve GPGPU throughput

Yi Yang; Ping Xiang; Mike Mantor; Norm Rubin; Huiyang Zhou

On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this paper, we propose software and/or hardware approaches to multiplex the shared memory among multiple thread blocks. Our proposed approaches are based on our observation that the current shared memory management reserves shared memory too conservatively, for the entire lifetime of a thread block. If the shared memory is allocated only when it is actually used and freed immediately after, more thread blocks can be hosted in an SM without increasing the shared memory capacity. We propose three software approaches to enable shared memory multiplexing and implement them using a source-to-source compiler. The experimental results show that our proposed software approaches effectively improve the throughput of many GPGPU applications on both NVIDIA GTX285 and GTX480 GPUs (an average of 1.44X on GTX285, 1.70X on GTX480 with 16kB shared memory, and 1.26X on GTX480 with 48kB shared memory). We also propose hardware support for shared memory multiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements (an average of 1.53X) to be achieved with very little change in GPGPU code.

international symposium on computer architecture | 2015

Dynamic thread block launch: a lightweight execution mechanism to support irregular applications on GPUs

Jin Wang; Norm Rubin; Albert Sidelnik; Sudhakar Yalamanchili

GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.

international conference on supercomputing | 2013

Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Ping Xiang; Yi Yang; Mike Mantor; Norm Rubin; Lisa R. Hsu; Huiyang Zhou

State-of-art graphics processing units (GPUs) employ the single-instruction multiple-data (SIMD) style execution to achieve both high computational throughput and energy efficiency. As previous works have shown, there exists significant computational redundancy in SIMD execution, where different execution lanes operate on the same operand values. Such value locality is referred to as uniform vectors. In this paper, we first show that besides redundancy within a uniform vector, different vectors can also have the identical values. Then, we propose detailed architecture designs to exploit both types of redundancy. For redundancy within a uniform vector, we propose to either extend the vector register file with token bits or add a separate small scalar register file to eliminate redundant computations as well as redundant data storage. For redundancy across different uniform vectors, we adopt instruction reuse, proposed originally for CPU architectures, to detect and eliminate redundancy. The elimination of redundant computations and data storage leads to both significant energy savings and performance improvement. Furthermore, we propose to leverage such redundancy to protect arithmetic-logic units (ALUs) and register files against hardware errors. Our detailed evaluation shows that our proposed design has low hardware overhead and achieves performance gains, up to 23.9% and 12.0% on average, along with energy savings, up to 24.8% and 12.6% on average, as well as a 21.1% and 14.1% protection coverage for ALUs and register files, respectively.

international symposium on computer architecture | 2016

LaPerm: locality aware scheduler for dynamic parallelism on GPUs

Jin Wang; Norm Rubin; Albert Sidelnik; Sudhakar Yalamanchili

Recent developments in GPU execution models and architectures have introduced dynamic parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The changes brought about by this extension to the traditional bulk synchronous parallel (BSP) model also creates new challenges in exploiting the current GPU memory hierarchy. One of the major challenges is that the reference locality that exists between the parent and child thread blocks (TBs) created during dynamic nested kernel and thread block launches cannot be fully leveraged using the current TB scheduling strategies. These strategies were designed for the current implementations of the BSP model but fall short when dynamic parallelism is introduced since they are oblivious to the hierarchical reference locality. We propose LaPerm, a new locality-aware TB scheduler that exploits such parent-child locality, both spatial and temporal. LaPerm adopts three different scheduling decisions to i) prioritize the execution of the child TBs, ii) bind them to the stream multiprocessors (SMXs) occupied by their parents TBs, and iii) maintain workload balance across compute units. Experiments with a set of irregular CUDA applications executed on a cycle-level simulator employing dynamic parallelism demonstrate that LaPerm is able to achieve an average of 27% performance improvement over the baseline round-robin TB scheduler commonly used in modern GPUs.

international conference on parallel architectures and compilation techniques | 2008

GPU evolution: will graphics morph into compute?

Norm Rubin

In the last several years GPU devices have started to evolve into supercomputers. New, non-graphics, features are rapidly appearing along with new more general programming languages. One reason for the quick pace of change is that, games and hardware evolve together: Hardware vendors review the most popular games, looking for places to add hardware while game developers review new hardware, looking for places to add more realism. Today, we see both GPU devices and games moving from a model of looks real to one of acts real. One consequence of acts real is that evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs. We will review the difference between a CPU and a GPU. Then we will describe hardware changes added to the current generation of AMD graphics processors, including the introduction of traditional compute operations such as double precision, scatter/gather and local memory. Along with new features, we have added new metrics like performance/watt and performance/dollar. The current AMD GPU processor delivers 9 gigaflops/watt and 5 gigaflops/dollar. For the last two generations, each AMD GPU has provided double the performance/watt of the prior machine. We believe the software community needs to become more aware and appreciate these metrics. Because this has been a kind of co-evolution and not a process of radical change, current GPU devices have retained a number of odd sounding transitional features, including fixed functions like memory systems that can do filtering, depth buffers, a rasterizer and the like. Today, each of these remain because they are important for graphics performance. Software on GPU devices also shows transitional features. As AI/physics virtual reality starts to become important, development frameworks have started to shift. Graphics APIs have added compute shaders. Finally, there has been a set of transitional programs implemented by graphics programmers but whose only real connection with graphics is that the result is rendered. One early example is toy shop which contains a weak physical simulation of rain on window (it looks great but the random number generator would not pass any kind of test). A more recent and better acting program is March of the Froblins an AI program related to robotic path calculations. This program both simulates large crowds of independent creatures and shows how massively parallel compute can benefit character-centric entertainment.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture

Ping Xiang; Yi Yang; Mike Mantor; Norm Rubin; Huiyang Zhou

Many-core architectures such as graphics processing units (GPUs) rely on thread-level parallelism (TLP)to overcome pipeline hazards. Consequently, each core in a many-core processor employs a relatively simple in-order pipeline with limited capability to exploit instruction-level parallelism (ILP). In this paper, we study the ILP impact on the throughput-oriented many-core architecture, including data bypassing, score boarding and branch prediction. We show that these ILP techniques significantly reduce the performance dependency on TLP. This is especially useful for applications, whose resource usage limits the hardware to run a high number of threads concurrently. Furthermore, ILP techniques reduce the demand on on-chip resource to support high TLP. Given the workload-dependent impact from ILP, we propose heterogeneous GPGPU architecture, consisting of both the cores designed for high TLP and those customized with ILPtechniques. Our results show that our heterogeneous GPUarchitecture achieves high throughput as well as high energy and area-efficiency compared to homogenous designs.

symposium on code generation and optimization | 2008

Issues and challenges in compiling for graphics processors

Norm Rubin

Graphics has been one of the best success stories of parallel processing. Using a unique combination of specialized hardware and aspecialized programming model, game developers routinely write high performance code using millions of threads. Each Generation of graphic processors (GPUs) delivers higher performance and is more programmable then the last. Unlike CPUs, these processors are designed from the beginning to run highly parallel programs and fit a different programming model. We would claim that a GPU is much more like a traditional supercomputer then a desktop processor. The programming model for graphics, called the graphics pipeline, is quite different from any of the CPU models. We will include a quick overview of this model. We believe that the Graphics programming model for utilizing the GPU will continue to be radically different from the CPU programming model for some time to come. The key feature that makes games perform well (at least to compiler writers) is that games are shipped as byte code and JIT compiled each time the game is started. The main focus of this talk is the AMD provided JIT compiler. We will describe some of the design decisions made in the JIT compiler and give some statistics on how well they worked. For instance we will explain what the compiler does with 256 thousand registers and why it would like more. Finally we will highlight some of the open research problems raised by graphics hardware.

Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages | 2018

Diesel: DSL for linear algebra and neural net computations on GPUs

Venmugil Elango; Norm Rubin; Mahesh Ravishankar; Hariharan Sandanagobalane; Vinod Grover

We present a domain specific language compiler, Diesel, for basic linear algebra and neural network computations, that accepts input expressions in an intuitive form and generates high performing code for GPUs. The current trend is to represent a neural network as a computation DAG, where each node in the DAG corresponds to a single operation such as matrix-matrix multiplication, and map the individual operations to hand tuned library functions provided by standard libraries such as CuBLAS and CuDNN. While this method takes advantage of readily available optimized library codes to achieve good performance for individual operations, it is not possible to optimize across operations. As opposed to this, given a computation composed of several operations, Diesel generates (a set) of efficient device functions, where the code is optimized for the computation as a whole, using polyhedral compilation techniques. In addition, there are cases where the code needs to be specialized for specific problem sizes to achieve optimal performance. While standard libraries are written for parametric problem sizes (where problem sizes are provided at runtime), Diesel can accept problem sizes at compile time and generate specialized codes. Experimental results show that the performance achieved by Diesel generated code for individual operations are comparable to the highly tuned versions provided by standard libraries, while for composite computations, Diesel outperforms manually written versions.

ieee international symposium on workload characterization | 2017

Moka: Model-based concurrent kernel analysis

Leiming Yu; Xun Gong; Yifan Sun; Qianqian Fang; Norm Rubin; David R. Kaeli

GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE.

international conference on parallel architectures and compilation techniques | 2012