David B. Kirk | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David B. Kirk is active.

Explore More

Publication

Featured researches published by David B. Kirk.

acm sigplan symposium on principles and practice of parallel programming | 2008

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Shane Ryoo; Christopher I. Rodrigues; Sara S. Baghsorkhi; Sam S. Stone; David B. Kirk; Wen-mei W. Hwu

GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processors organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each threads resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.

international symposium on memory management | 2007

NVIDIA cuda software and gpu parallel computing architecture

David B. Kirk

In the past, graphics processors were special purpose hardwired application accelerators, suitable only for conventional rasterization-style graphics applications. Modern GPUs are now fully programmable, massively parallel floating point processors. This talk will describe NVIDIAs massively multithreaded computing architecture and CUDA software for GPU computing. The architecture is a scalable, highly parallel architecture that delivers high throughput for data-intensive processing. Although not truly general-purpose processors, GPUs can now be used for a wide variety of compute-intensive applications beyond graphic.

Communications of The ACM | 2010

Understanding throughput-oriented architectures

Michael Garland; David B. Kirk

For workloads with abundant parallelism, GPUs deliver higher peak computational throughput than latency-oriented CPUs.

international conference on computer graphics and interactive techniques | 1998

Unsolved problems and opportunities for high-quality, high-performance 3D graphics on a PC platform

David B. Kirk

In the late 1990’s, graphics hardware is experiencing a dramatic board-to-chip integration reminiscent to the minicomputer-to-microprocessor revolution of the 1980’s. Today, mass-market PCs are beginning to match the 3D polygon and pixel rendering of a 1992 Silicon Graphics Reality EngineTM system. The extreme pace of technology evolution in the PC market is such that within 1 or 2 years the performance of a mainstream PC will be very close to the highest performance 3D workstations. At that time, the quality and performance demands will dictate serious changes in PC architecture as well as changes in rendering pipeline and algorithms. This paper will discuss several potential areas of change. A GENERAL PROBLEM STATEMENT The biggest focus of 3D graphics applications on the PC is interactive entertainment, or games. This workload is extremely dynamic, with continuous updating of geometry, textures, animation, lighting, and shading. Although in other applications such as Computer-AidedDesign (CAD), models may be static and retained mode or display list APIs may be used, it is common in games that geometry and textures change regularly. A good operating assumption is that everything changes every frame. The assumption of pervasive change puts a large burden on both the bandwidth and calculation capabilities of the graphics pipeline. GEOMETRY AND PIXEL THROUGHPUT As a baseline, we’ll start with some data and cycle counting of a reasonable workload for an interactive application. PC graphics hardware is capable of this throughput. As an example, this is a bandwidth analysis of a 400 MHz Intel Pentium IITM PC with an Nvidia RNA TNTTM graphics processor. This analysis does not derive from a specific application, but is simply a counting exercise. Many applications push one or more of these limits, but few programs stress all axes. For a typical application to achieve 1M triangles/second, 1 OOM 32bit pixels/second, 2 textures/pixel requires: 1 M triangles * 3 vertices/triangle * 32 bytes/vertex = 100 MB; triangle data crosses the bus 3-5 times (read, transform and written by the CPU, and read by the graphics processor, so simply copying triangle data requires 300-500 MB/second on the PC buses. 1OOM pixels * 8 bytes/pixel (32bit RGBA, 32bit Z/stencil) = 800 MB; with 50% overhead for RMW requires 1.2 GB/second 2 textures/pixel * 4 texelsltexture * 2 bytee a texture cache can create up to 4X reuse efficiency, so requires 400 MB/second Assumptions here include: 32-byte vertices are Direct3DTM TLVertices (X,Y,Z,R,G,B,A,F,SR,SG,SB,W) triangle setup is done on the graphics processor bilinear texture filtering 16bit texels are RSG6B5 50% of pixels written after Zbuffer read/compare Transferring triangle vertex data to the graphics processor from the CPU is commonly the bottleneck. This is different from typical workstations or the PCs of just 1 year ago, when transform and lighting calculation, fill rate, or texture rate were limiting factors. GEOMETRY REPRESENTATION As pixel shading, texturing, and fill rates rise, the most constrained bottleneck in the system will increasingly become creation and transfer of geometry information. The data required to represent a triangle comprises the bulk of system bus traffic in an aggressive 3D application. As

IEEE Transactions on Parallel and Distributed Systems | 2015

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

Javier Cabezas; Isaac Gelado; John E. Stone; Nacho Navarro; David B. Kirk; Wen-mei W. Hwu

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

international conference on computer graphics and interactive techniques | 2002

When will ray-tracing replace rasterization?

Kurt Akeley; David B. Kirk; Larry D. Seiler; Philipp Slusallek; Brad Grantham

Ray-tracing produces images of stunning quality but is difficult to make interactive. Rasterization is fast but making realistic images with it requires splicing many different algorithms together. Both GPU and CPU hardware grow faster each year. Increased GPU performance facilitates new techniques for interactive realism, including high polygon counts, multipass rendering, and texture-intensive techniques such as bumpmapping and shadows. On the other hand, increased CPU performance and dedicated ray-tracing hardware push the potential framerate of ray-tracing ever higher.

international conference on computer graphics and interactive techniques | 2003

Graphics architectures: the dawn of cinematic computing

David B. Kirk

A few short years ago, single-chip PC 3D graphics solutions arrived on the market at performance levels that rivaled professional workstations with multi-chip graphics pipelines. Since then, graphics performance has grown at a rate approaching doubling every 6 months, far exceeding Moores Law. There is evidence that this geometric performance growth is not only possible, but inevitable. The reason lies in the way that graphics architectures have evolved, and the fact that this evolution has taken a very different path than CPUs. As GPUs become more flexible, powerful, and programmable, their architecture is well-suited to embrace the parallelism that is inherent in graphics, shading, and other hard computational problems. Todays graphics processors are able to render cinematic quality images interactively for games, professional applications, and authoring of content. How will cinematic computing change the field of Graphics?

Programming Massively Parallel Processors (Third Edition)#R##N#A Hands-on Approach | 2017

Chapter 21 – Conclusion and outlook

David B. Kirk; Wen-mei W. Hwu

This chapter summarizes the main parts of the book. It then concludes the book by offering an outlook of how parallel programming will continue to contribute to the new innovations in science and technology.

Programming Massively Parallel Processors (Third Edition)#R##N#A Hands-on Approach | 2017

Parallel patterns: convolution: An introduction to stencil computation

David B. Kirk; Wen-mei W. Hwu

This chapter presents convolution as an important parallel computation pattern. While convolution is used in many applications such as computer vision and video processing, it also represents a general pattern that forms the basis of many parallel algorithms. We start with the concept of convolution. We then present a basic parallel convolution algorithm whose execution speed is limited by DRAM bandwidth for accessing both the input and mask elements. We then introduced the constant memory and a simple modification to the kernel and host code to practically eliminate all the DRAM accesses. This is followed by an input tiling kernel that eliminates most of the DRAM accesses for the input elements. We show that the code can be simplified with data caching in more recent devices. We then move into a 2D convolution kernel, along with an analysis of the effectiveness of tiling as a function of tile sizes for 1D and 2D convolution.

Programming Massively Parallel Processors (Third Edition)#R##N#A Hands-on Approach | 2017

Chapter 14 – Application case study—non-Cartesian magnetic resonance imaging: An introduction to statistical estimation methods

David B. Kirk; Wen-mei W. Hwu

This chapter presents an application study on using CUDA C and GPU computing to accelerate an iterative solver for reconstruction of an MRI image from Non-Cartesian scan data. It covers the process of identifying the appropriate type of parallelism, loop transformations, mapping data into constant memory, mapping data into registers, data layout transformations, using special hardware instructions, and experimental tuning. It also demonstrates a process of validating the design choices with domain-specific criteria.

Explore More