Brucek Khailany | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Brucek Khailany is active.

Explore More

Publication

Featured researches published by Brucek Khailany.

international symposium on microarchitecture | 2001

Imagine: media processing with streams

Brucek Khailany; William J. Dally; Ujval J. Kapasi; Peter R. Mattson; John D. Owens; Brian Towles; Andrew Chang; Scott Rixner

The power-efficient Imagine stream processor achieves performance densities comparable to those of special-purpose embedded processors. Executing programs mapped to streams and kernels, a single Imagine processor is expected to have a peak performance of 20 gflops and sustain 18.3 gops on mpeg-2 encoding.

international symposium on microarchitecture | 2011

GPUs and the Future of Parallel Computing

Stephen W. Keckler; William J. Dally; Brucek Khailany; Michael Garland; David B. Glasco

This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

IEEE Computer | 2003

Programmable stream processors

Ujval J. Kapasi; Scott Rixner; William J. Dally; Brucek Khailany; Jung Ho Ahn; Peter R. Mattson; John D. Owens

The demand for flexibility in media processing motivates the use of programmable processors. Stream processing bridges the gap between inflexible special-purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications. The central idea behind stream processing is to organize an application into streams and kernels to expose the inherent locality and concurrency in media-processing applications. The performance of the Imagine stream processor on these media application is given.

high performance computer architecture | 2000

Register organization for media processing

Scott Rixner; William J. Dally; Brucek Khailany; Peter R. Mattson; Ujval J. Kapasi; John D. Owens

Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis and image understanding, require arithmetic rates of up to 10/sup 11/ operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay and power of the arithmetic units. In this paper, we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level-parallel and memory-hierarchy axes, and by optimizing the hierarchical register organization for operation on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file area, delay and power dissipation of a media processor by factors of 195, 230 and 430 respectively. This reduction in cost is achieved with a performance degradation of only 8% on a representative set of media processing benchmarks.

international symposium on microarchitecture | 1998

A bandwidth-efficient architecture for media processing

Scott Rixner; William J. Dally; Ujval J. Kapasi; Brucek Khailany; Abelardo López-Lagunas; Peter R. Mattson; John D. Owens

Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional microprocessor architectures, they are a good fit for modern VLSI technology with its high arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams of data records passing through computation kernels, exposes both parallelism and locality in media applications that can be exploited by VLSI architectures. The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications. Compared to a conventional scalar processor. Imagine reduces the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively. This bandwidth efficiency enables a single chip Imagine processor to achieve a peak performance of 16.2GFLOPS (single-precision floating point) and sustained performance of up to 8.5GFLOPS on media processing kernels.

international conference on computer design | 2002

The Imagine Stream Processor

Ujval J. Kapasi; William J. Dally; Scott Rixner; John D. Owens; Brucek Khailany

The Imagine Stream Processor is a single-chip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on single-precision data and 32 GOPS on 16 bit fixed-point data. The scalability of Imagines programming model and architecture enable it to achieve such high arithmetic rates. Imagine executes applications that have been mapped to the stream programming model. The stream model decomposes applications into a set of computation kernels that operate on data streams. This mapping exposes the inherent locality and parallelism in the application, and Imagine exploits the locality and parallelism to provide a scalable architecture that supports 48 ALUs on a single chip. This paper presents the Imagine architecture and programming model in the first half and explores the scalability of the Imagine architecture in the second half.

international symposium on computer architecture | 2004

Evaluating the Imagine Stream Architecture

Jung Ho Ahn; William J. Dally; Brucek Khailany; Ujval J. Kapasi; Abhishek Das

This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine (Kapasi et al., 2002) is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producer-consumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic micro-benchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor. On a set of media processing kernels, Imagine sustained an average of 43% of peak arithmetic performance. An evaluation of full applications provides a breakdown of where execution time is spent. Over full applications, Imagine achieves 39.4% of peak performance, of the remainder on average 36.4% of time is lost due to load imbalance between arithmetic units in the VLIW clusters and limited instruction-level parallelism within kernel inner loops, 10.6% is due to kernel startup and shutdown overhead because of short stream lengths, 7.6% is due to memory stalls, and the rest is due to insufficient host processor bandwidth. Further analysis included in the paper presents the impact of host instruction bandwidth on application performance, particularly on smaller datasets. In summary, the experimental measurements described in this paper demonstrate the high performance and efficiency of stream processing: operating at 200 MHz, Imagine sustains 4.81 GFLOPS on QR decomposition while dissipating 7.42 Watts.

international symposium on microarchitecture | 2000

Efficient conditional operations for data-parallel architectures

Ujval J. Kapasi; William J. Dally; Scott Rixner; Peter R. Mattson; John D. Owens; Brucek Khailany

Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform poorly on these architectures. Conditional streams convert these constructs into data-dependent data movement. This allows data-parallel architectures to efficiently execute applications with data-dependent control flow. Essentially, conditional streams extend the range of applications that a data-parallel architecture can execute efficiently. For example, polygon rendering speeds up by a factor of 1.8 with the use of conditional streams.

ieee international conference on high performance computing data and analytics | 2011

CudaDMA: optimizing GPU memory bandwidth via warp specialization

Michael Bauer; Henry Cook; Brucek Khailany

As the computational power of GPUs continues to scale with Moores Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37× on representative synthetic micro-benchmarks, and 1.15×-3.2× on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs.

international symposium on microarchitecture | 2012

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

Mark Gebhart; Stephen W. Keckler; Brucek Khailany; Ronny Krashinsky; William J. Dally

Modern throughput processors such as GPUs employ thousands of threads to drive high-bandwidth, long-latency memory systems. These threads require substantial on-chip storage for registers, cache, and scratchpad memory. Existing designs hard-partition this local storage, fixing the capacities of these structures at design time. We evaluate modern GPU workloads and find that they have widely varying capacity needs across these different functions. Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a per-application basis. The tuning that this flexibility enables improves both performance and energy consumption, and broadens the scope of applications that can be efficiently executed on GPUs. Compared to a hard-partitioned design, we show that unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33%.

Explore More