Erik Brunvand | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Erik Brunvand is active.

Explore More

Publication

Featured researches published by Erik Brunvand.

high-performance computer architecture | 1999

Impulse: building a smarter memory controller

John B. Carter; Wilson C. Hsieh; Leigh Stoller; Mark R. Swanson; Lixin Zhang; Erik Brunvand; Al Davis; Chen-Chi Kuo; Ravindra Kuramkote; Michael A. Parker; Lambert Schaelicke; Terry Tateyama

Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can be used to improve the performance of memory-bound programs. For the NAS conjugate gradient benchmark, Impulse improves performance by 67%. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In addition to scientific applications, we expect that Impulse will benefit regularly strided memory-bound applications of commercial importance, such as database and multimedia programs.

international conference on computer aided design | 1989

Translating concurrent programs into delay-insensitive circuits

Erik Brunvand; Robert F. Sproull

A method is presented for automatically translating a concurrent program into an asynchronous circuit. The translation procedure involves a simple syntax-directed translation from program constructs into initial asynchronous circuits. The resulting circuits are improved with correctness-preserving circuit-to-circuit transformations similar to peephole optimization in conventional compilers. Because these steps can be proved to be correct, the programmer is guaranteed that any specification met by the program will also be met by the circuit. A system has been constructed to perform the translation automatically. A brief description of the method is given, followed by two examples of programs translated into circuits.<<ETX>>

hawaii international conference on system sciences | 1993

The NSR processor

Erik Brunvand

The NSR processor is a general-purpose computer structured as a collection of self-timed blocks. These blocks operate concurrently and cooperate by communicating with other blocks using self-timed communication protocols. The blocks that make up the NSR processor correspond to standard synchronous pipeline stages such as instruction fetch, instruction decode, execute, memory interface and register file, but each operates concurrently as a separate self-timed process. In addition to being internally self-timed, the units are decoupled through self-timed first-in first-out (FIFO) queues between each of the units which allows a high degree of overlap in instruction execution. Branches, jumps, and memory accesses are also decoupled through the use of additional FIFO queues which can hide the execution latency of these instructions. A prototype implementation of the NSR processor has been constructed using Actel FPGAs (field programmable gate arrays).<<ETX>>

2006 IEEE Symposium on Interactive Ray Tracing | 2006

Estimating Performance of a Ray-Tracing ASIC Design

Sven Woop; Erik Brunvand; Philipp Slusallek

Recursive ray tracing is a powerful rendering technique used to compute realistic images by simulating the global light transport in a scene. Algorithmic improvements and FPGA-based hardware implementations of ray tracing have demonstrated realtime performance but hardware that achieves performance levels comparable to commodity rasterization graphics chips is still not available. This paper describes the architecture and ASIC implementations of the DRPU design (dynamic ray processing unit) that closes this performance gap. The DRPU supports fully programmable shading and most kinds of dynamic scenes and thus provides similar capabilities as current GPUs. It achieves high efficiency due to SIMD processing of floating point vectors, massive multithreading, synchronous execution of packets of threads, and careful management of caches for scene data. To support dynamic scenes B-KD trees are used as spatial index structures that are processed by a custom traversal and intersection unit and modified by an update processor on scene changes. The DRPU architecture is specified as a high-level structural description in a functional language and mapped to both FPGA and ASIC implementations. Our FPGA prototype clocked at 66 MHz achieves higher ray tracing performance than CPU-based ray tracers even on a modern multi-GHz CPU. We provide performance results for two 130 nm ASIC versions and estimate what performance would be using a 90 nm CMOS process. For a 90nm version with a 196 mm2 die we conservatively estimate clock rates of 400 MHz and ray tracing performance of 80 to 290 fps at 1024times768 resolution in our test scenes. This estimated performance is 70 times faster than what is achievable with standard multi-GHz desktop CPUs

field programmable gate arrays | 1993

Using FPGAs to implement self-timed systems

Erik Brunvand

Asynchronous or self-timed systems that do not rely on a global clock to keep system components synchronized can offer significant advantages over traditional clocked circuits in a variety of applications. In order to ease the complexity of this style of design, however, suitable self-timed circuit primitives must be available to the system designer. This article describes a technique for building self-timed circuits and systems using a library of circuit primitives implemented using Actel field programmable gate arrays (FPGAs). The library modules use a two-phase transition signaling protocol for control signals and a bundled protocol for data signals. A first-in first-out (FIFO) buffer and a simple routing chip are presented as examples of building self-timed circuits using FPGAs.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2009

TRaX: A Multicore Hardware Architecture for Real-Time Ray Tracing

Josef B. Spjut; Andrew E. Kensler; Daniel Kopta; Erik Brunvand

Threaded Ray eXecution (TRaX) is a highly parallel multithreaded multicore processor architecture designed for real-time ray tracing. The TRaX architecture consists of a set of thread processors that include commonly used functional units (FUs) for each thread and that share larger FUs through a programmable interconnect. The memory system takes advantage of the applications read-only access to the scene database and write-only access to the frame buffer output to provide efficient data delivery with a relatively simple memory system. One specific motivation behind TRaX is to accelerate single-ray performance instead of relying on ray packets in single-instruction-multiple-data mode to boost throughput, which can fail as packets become incoherent with respect to the objects in the scene database. In this paper, we describe the TRaX architecture and our performance results compared to other architectures used for ray tracing. Simulated results indicate that a multicore version of the TRaX architecture running at a modest speed of 500 MHz provides real-time ray-traced images for scenes of a complexity found in video games. We also measure performance as secondary rays become less coherent and find that TRaX exhibits only minor slowdown in this case while packet-based ray tracers show more significant slowdown.

international conference on computer design | 1994

Performance analysis and optimization of asynchronous circuits

Prabhakar Kudva; Ganesh Gopalakrishnan; Erik Brunvand; Venkatesh Akella

Asynchronous/self-timed circuits are beginning to attract renewed attention as a promising means of dealing with the complexity of modern VLSI designs. Very few analysis techniques or tools are available for estimating their performance. We adapt the theory of generalized timed Petri-nets (GTPN) for analyzing and comparing asynchronous circuits ranging from purely control-oriented circuits to those with data dependent control. Experiments with the GTPN analyzer are found to track the observed performance of actual asynchronous circuits, thereby offering empirical evidence towards the soundness of the modeling approach.<<ETX>>

interactive 3d graphics and games | 2012

Fast, effective BVH updates for animated scenes

Daniel Kopta; Thiago Ize; Josef B. Spjut; Erik Brunvand; Al Davis; Andrew E. Kensler

Bounding volume hierarchies (BVHs) are a popular acceleration structure choice for animated scenes rendered with ray tracing. This is due to the relative simplicity of refitting bounding volumes around moving geometry. However, the quality of such a refitted tree can degrade rapidly if objects in the scene deform or rearrange significantly as the animation progresses, resulting in dramatic increases in rendering times and a commensurate reduction in the frame rate. The BVH could be rebuilt on every frame, but this could take significant time. We present a method to efficiently extend refitting for animated scenes with tree rotations, a technique previously proposed for off-line improvement of BVH quality for static scenes. Tree rotations are local restructuring operations which can mitigate the effects that moving primitives have on BVH quality by rearranging nodes in the tree during each refit rather than triggering a full rebuild. The result is a fast, lightweight, incremental update algorithm that requires negligible memory, has minor update times, parallelizes easily, avoids significant degradation in tree quality or the need for rebuilding, and maintains fast rendering times. We show that our method approaches or exceeds the frame rates of other techniques and is consistently among the best options regardless of the animated scene.

international conference on computer design | 2010

Efficient MIMD architectures for high-performance ray tracing

Daniel Kopta; Josef B. Spjut; Erik Brunvand; Al Davis

Ray tracing efficiently models complex illumination effects to improve visual realism in computer graphics. Typical modern GPUs use wide SIMD processing, and have achieved impressive performance for a variety of graphics processing including ray tracing. However, SIMD efficiency can be reduced due to the divergent branching and memory access patterns that are common in ray tracing codes. This paper explores an alternative approach using MIMD processing cores custom-designed for ray tracing. By relaxing the requirement that instruction paths be synchronized as in SIMD, caches and less frequently used area expensive functional units may be more effectively shared. Heavy resource sharing provides significant area savings while still maintaining a high MIMD issue rate from our numerous light-weight cores. This paper explores the design space of this architecture and compares performance to the best reported results for a GPU ray tracer and a parallel ray tracer using general purpose cores. We show an overall performance that is six to ten times higher in a similar die area.

international symposium on advanced research in asynchronous circuits and systems | 1996

Fred: an architecture for a self-timed decoupled computer

William F. Richardson; Erik Brunvand

Decoupled computer architectures provide an effective means of exploiting instruction level parallelism. Self-timed micropipeline systems are inherently decoupled due to the elastic nature of the basic FIFO structure, and may be ideally suited for constructing decoupled computer architectures. Fred is a self-timed decoupled, pipelined computer architecture based on micropipelines. We present the architecture of Fred, with specific details on a micropipelined implementation that includes support for multiple functional units and out-of-order instruction completion due to the self-timed decoupling.

Explore More