Is this you? Create Your Porfile

Justin Hensley

University of North Carolina at Chapel Hill

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Justin Hensley is active.

Explore More

Publication

Featured researches published by Justin Hensley.

ieee visualization | 2001

PixelFlex: a reconfigurable multi-projector display system

Ruigang Yang; David Gotz; Justin Hensley; Herman Towles; Michael S. Brown

This paper presents PixelFlex - a spatially reconfigurable multi-projector display system. The PixelFlex system is composed of ceiling-mounted projectors, each with computer-controlled pan, tilt, zoom and focus; and a camera for closed-loop calibration. Working collectively, these controllable projectors function as a single logical display capable of being easily modified into a variety of spatial formats of differing pixel density, size and shape. New layouts are automatically calibrated within minutes to generate the accurate warping and blending functions needed to produce seamless imagery across planar display surfaces, thus giving the user the flexibility to quickly create, save and restore multiple screen configurations. Overall, PixelFlex provides a new level of automatic reconfigurability and usage, departing from the static, one-size-fits-all design of traditional large-format displays. As a front-projection system, PixelFlex can be installed in most environments with space constraints and requires little or no post-installation mechanical maintenance because of the closed-loop calibration.

Computer Graphics Forum | 2005

Fast Summed-Area Table Generation and its Applications

Justin Hensley; Thorsten Scheuermann; Greg Coombe; Montek Singh; Anselmo Lastra

We introduce a technique to rapidly generate summed-area tables using graphics hardware. Summed area tables, originally introduced by Crow, provide a way to filter arbitrarily large rectangular regions of an image in a constant amount of time. Our algorithm for generating summed-area tables, similar to a technique used in scientific computing called recursive doubling, allows the generation of a summed-area table in O(log n) time. We also describe a technique to mitigate the precision requirements of summed-area tables. The ability to calculate and use summed-area tables at interactive rates enables numerous interesting rendering effects. We present several possible applications. First, the use of summed-area tables allows real-time rendering of interactive, glossy environmental reflections. Second, we present glossy planar reflections with varying blurriness dependent on a reflected object’s distance to the reflector. Third, we show a technique that uses a summed-area table to render glossy transparent objects. The final application demonstrates an interactive depth-of-field effect using summedarea tables.

international symposium on microarchitecture | 1999

Exploiting ILP in page-based intelligent memory

Mark Oskin; Justin Hensley; Diana Keen; Frederic T. Chong; Matthew K. Farrens; Aneet Chopra

This study compares the speed, area, and power of different implementations of Active Pages, an intelligent memory system which helps bridge the growing gap between processor and memory performance by associating simple functions with each page of data. Previous investigations have shown up to 1000X speedups using a block of reconfigurable logic to implement these functions next to each subarray on a DRAM chip. In this study, we show that instruction-level parallelism, not hardware specialization, is the key to the previous success with reconfigurable logic. In order to demonstrate this fact, an Active Page implementation based upon a simplified VLIW processor was developed. Unlike conventional VLIW processors, power and area constraints lead to a design which has a small number of pipeline stages. Our results demonstrate that a four-wide VLIW processor attains comparable performance to that of pure FPGA logic but requires significantly less area and power.

IEEE Transactions on Computers | 2003

Cache coherence in intelligent memory systems

Diana Keen; Mark Oskin; Justin Hensley; Frederic T. Chong

The Active Pages model of intelligent memory can speed up data-intensive applications by up to two to three orders of magnitude over conventional systems. A fundamental problem with intelligent memory, however, arises when data cached by the processor is modified by logic in the memory. The Active Page model inherently limits sharing, keeping coherence tractable, but exacerbates saturation problems. We first present a hybrid snoopy/directory protocol for use in Active Pages. Limited sharing allows for a low-latency, low-bandwidth hybrid protocol. A transparent remapping mechanism is added for efficient caching. On smaller data sizes, explicit flushing and hardware coherence exhibit similar performance, but hardware coherence is easier to program and uses less bandwidth. Finally, we examine SMP multiprocessor systems to mitigate saturation effects. As the number of threads increases, the bandwidth needs increase, making hardware coherence even more attractive.

international conference on computer design | 2004

An area- and energy-efficient asynchronous Booth multiplier for mobile devices

Justin Hensley; Anselmo Lastra; Montek Singh

The recent explosion in the number of handheld multimedia devices has created a need for energy-efficient computation due to limited battery lifetimes. We focus on multiplication, which is needed in several application domains, e.g., 3D graphics, signal processing, and cryptography. We introduce an asynchronous implementation of a plain Booth multiplier (i.e., radix-2), which is both area- and energy-efficient, and therefore suitable for mobile applications. This paper makes the following contributions. First, a novel counterflow organization is introduced, in which the data bits flow in one direction, and the Booth commands piggyback on the acknowledgments flowing in the opposite direction. Second, the arithmetic and shifter units are merged together to obtain significant improvement in area, energy as well as speed. Third, our design performs overlapped execution of multiple iterations of the Booth algorithm. Finally, the design is quite modular, which allows scaling to arbitrary operand widths, without gate resizing or cycle time overheads. Spice simulations in a 0.18 /spl mu/m TSMC process at 1.8 V, indicate promising performance: the multiplier takes 1.08 ns per Booth iteration, regardless of the operand widths, thereby demonstrating the scalability of our approach. In addition, the multiplier is fully functional at reduced supply voltages (e.g., 1.0 V), and thus capable of dynamically trading off performance for energy efficiency.

international conference on computer graphics and interactive techniques | 2005

Interactive summed-area table generation for glossy environmental reflections

Justin Hensley; Thorsten Scheuermann; Montek Singh; Anselmo Lastra

There are many applications in computer graphics where spatially varying filters are useful. One example is the rendering of glossy reflections. Unlike perfectly reflective materials, which only require a single radiance sample in the direction of the reflection vector, glossy materials require integration over a solid angle. Blurring by filtering the reflected image with a support dependent on the surface‘s BRDF can approximate this effect. This is currently done by pre-filtering off-line, which limits the technique to static environments. Crow [1984] introduced summed-area tables to enable more general texture filtering than was possible with mip maps. Once generated, a summed-area table provides a means to evaluate a spatially varying box filter with a constant number of texture reads.

ieee international symposium on asynchronous circuits and systems | 2005

A scalable counterflow-pipelined asynchronous radix-4 Booth multiplier

Justin Hensley; Anselmo Lastra; Montek Singh

This paper introduces an asynchronous radix-4 Booth multiplier architecture, which is scalable to arbitrary operand lengths while maintaining a constant cycle time per Booth iteration. It has several novel features, including: (i) a novel counterflow organization, in which the data bits flow in one direction and the Booth commands piggyback on the acknowledgments flowing in the opposite direction; (ii) overlapped execution of multiple iterations of the Booth algorithm; and (iii) design modularity and bit-level pipelining, which enable the multiplier to be scaled to arbitrary operand widths without requiring gate resizing or cycle time overheads. Spice simulations in a 0.18 /spl mu/m TSMC CMOS process at 1.8 V indicate promising performance: the multiplier takes 640-650 ps per Booth iteration, regardless of the operand widths, thereby demonstrating the scalability of our approach. For 16-bit operands, this performance corresponds to nearly 200 Mops/s throughput. Furthermore, the multiplier is fully functional at reduced supply voltages (e.g., 1.5 V and 1.0 V), and thus capable of dynamically trading off performance for energy efficiency.

Parallel Processing Letters | 2000

Algorithmic Complexity with Page-Based Intelligent Memory

Mark Oskin; Lucian Vlad Lita; Frederic T. Chong; Justin Hensley; Diana Keen

High DRAM densities will make intelligent memory chips a commodity in the next five years [1] [2]. This paper focuses upon a promising model of computation in intelligent memory, Active Pa#es[3], where computation is associated with each page of memory. Computational hardware scales linearly and inexpensively with data size in this model, reducing the order of many algorithms. This scaling can, for example, reduce linear-time algorithms to 0(y/ii). When page-based intelligent memory chips become available in commodity, they will change the way programmers select and utilize algorithms. In this paper, we analyze the asymptotic performance of several common algorithms as problem sizes scale. We also derive the optimal page size, as a function of problem size, for each algorithm running with intelligent memory. Finally, we validate these analyses with simulation results.

international conference on computer design | 2000

Reducing cost and tolerating defects in page-based intelligent memory

Mark Oskin; Diana Keen; Justin Hensley; Lucian Vlad Lita; Frederic T. Chong

Active Pages is a page-based model of intelligent memory specifically designed to support virtualized hardware resources. Previous work has shown substantial performance benefits from off loading data-intensive tasks to a memory system that implements Active Pages. With a simple VLIW processor embedded near each page on DRAM, Active Page memory systems achieve up to 1000X speedups over conventional memory systems. In this study, we examine Active Page memories that share, or multiplex, embedded VLIW processors across multiple physical Active Pages. We explore the trade-off between individual page-processor performance and page-level multiplexing. We find that hardware costs of computational logic can be reduced from 31% of DRAM chip area to 12%, through multiplexing, without significant loss in performance. Furthermore, manufacturing defects that disable up to 50% of the page processors can be tolerated through efficient resource allocation and associative multiplexing.

Parallel Processing Letters | 2002

OPERATING SYSTEMS TECHNIQUES FOR PARALLEL COMPUTATION IN INTELLIGENT MEMORY

Mark Oskin; Diana Keen; Justin Hensley; Lucian Vlad Lita; Frederic T. Chong

Advances in DRAM density have led to several proposals to perform computation in memory [1] [2] [3]. Active Pages is a page-based model of intelligent memory that can exploit large amounts of parallel computation in data-intensive applications. With a simple VLIW processor embedded near each page on DRAM, Active Page memory systems achieve up to 1000X speedups over conventional memory systems [4]. Active Pages are specifically designed to support virtualized hardware resources. In this study, we examine operating system techniques that allow Active Page memories to share, or multiplex, embedded VLIW processors across multiple physical Active Pages. We explore the trade-off between individual page-processor performance and page-level multiplexing. We find that hardware costs of computational logic can be reduced from 31% of DRAM chip area to 12%, through multiplexing, without significant loss in performance. Furthermore, manufacturing defects that disable up to 50% of the page processors can be tolerated through efficient resource allocation and associative multiplexing.

Explore More