Wilson W. L. Fung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wilson W. L. Fung is active.

Explore More

Publication

Featured researches published by Wilson W. L. Fung.

international symposium on performance analysis of systems and software | 2009

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda; George L. Yuan; Wilson W. L. Fung; Henry Wong; Tor M. Aamodt

Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrows manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIAs CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIAs parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

international symposium on microarchitecture | 2007

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Wilson W. L. Fung; Ivan Sham; George L. Yuan; Tor M. Aamodt

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead incurred by control hardware. Scalar threads are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One approach is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance. In this paper, we explore mechanisms for more efficient SIMD branch execution on GPUs. We show that a realistic hardware implementation that dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes improves performance by an average of 20.7% for an estimated area increase of 4.7%.

high-performance computer architecture | 2011

Thread block compaction for efficient SIMT control flow

Wilson W. L. Fung; Tor M. Aamodt

Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data “cores” to improve throughput per unit hardware cost. Programming models for these accelerators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep-dubbed single-instruction, multiple-thread (SIMT) by NVIDIA. While current GPUs employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads (where such sharing may at first seem detrimental). In our proposal, warps within a thread block share a common block-wide stack for divergence handling. At a divergent branch, threads are compacted into new warps in hardware. Our simulation results show that this compaction mechanism provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism, and 17% versus dynamic warp formation on a set of CUDA applications that suffer significantly from control flow divergence.

high-performance computer architecture | 2013

Cache coherence for GPU architectures

Inderpreet Singh; Arrvindh Shriraman; Wilson W. L. Fung; Mike O'Connor; Tor M. Aamodt

While scalable coherence has been extensively studied in the context of general purpose chip multiprocessors (CMPs), GPU architectures present a new set of challenges. Introducing conventional directory protocols adds unnecessary coherence traffic overhead to existing GPU applications. Moreover, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC) [34, 54], explored the use of time-based approaches in CMP coherence protocols. This paper describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present an implementation of TC, called TC-Weak, which eliminates LCCs trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications with inter-workgroup communication by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also find that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.

international symposium on microarchitecture | 2011

Hardware transactional memory for GPU architectures

Wilson W. L. Fung; Inderpreet Singh; Andrew Brownsword; Tor M. Aamodt

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloomfilter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128×faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.

ACM Transactions on Architecture and Code Optimization | 2009

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Wilson W. L. Fung; Ivan Sham; George L. Yuan; Tor M. Aamodt

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in todays desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%.

international symposium on performance analysis of systems and software | 2010

Visualizing complex dynamics in many-core accelerator architectures

Aaron Ariel; Wilson W. L. Fung; Andrew E. Turner; Tor M. Aamodt

While many-core accelerator architectures, such as todays Graphics Processing Units (GPUs), offer orders of magnitude more raw computing power than contemporary CPUs, their massive parallelism often produces complex dynamic behaviors even with the simplest applications. Using a fixed set of hardware or simulator performance counters to quantify behavior over a large interval of time such as an entire application execution run or program phase may not capture this behavior. Software and/or hardware designers may consequently miss out on opportunities to optimize for better performance. Similarly, significant effort may be expended to find metrics that explain anomalous behavior in architecture design studies. Moreover, the increasing complexity of applications developed for todays GPU has created additional difficulties for software developers when attempting to identify bottlenecks of an application for optimization. This paper presents a novel GPU performance visualization tool, AerialVision, to address these two problems. It interfaces with the GPGPU-Sim simulator to capture and visualize the dynamic behavior of a GPU architecture throughout an application run. Similar to existing performance analysis tools for CPUs, it can annotate individual lines of source code with performance statistics to simplify the bottleneck identification process. To provide further insight, AerialVision introduces a novel methodology to relate pathological dynamic architectural behaviors resulting in performance loss with the part of the source code that is responsible. By rapidly providing insight into complex dynamic behavior, AerialVision enables research on improving many-core accelerator architectures and will help ensure applications written for these architectures reach their full performance potential.

international symposium on microarchitecture | 2013

Energy efficient GPU transactional memory via space-time optimizations

Wilson W. L. Fung; Tor M. Aamodt

Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer. One major, non-trivial effort/risk is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks via fine-grained data synchronization. To reduce this effort, prior work has proposed supporting transactional memory on GPU architectures. One hardware proposal, Kilo TM, can scale to 1000s of concurrent transaction. However, performance and energy overhead of Kilo TM may deter GPU vendors from incorporating it into future designs. In this paper, we analyze the performance and energy efficiency of Kilo TM and propose two enhancements: (1) Warp-level transaction management allows transactions within a warp to be managed as a group. This aggregates protocol messages to reduce communication overhead and captures spatial locality from multiple transactions to increase memory subsystem utility. (2) Temporal conflict detection uses globally synchronized timers to detect conflicts in read-only transactions with low overhead. Our evaluation shows that combining the two enhancements in combination can improve the overall performance and energy efficiency of Kilo TM by 65% and 34% respectively. Kilo TM with the above two enhancements achieves 66% of the performance of fine-grained locking with 34% energy overhead.

architectural support for programming languages and operating systems | 2013

GPUDet: a deterministic GPU architecture

Hadi Jooybar; Wilson W. L. Fung; Mike O'Connor; Joseph Devietti; Tor M. Aamodt

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits ones ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs. Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDets design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations. Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

medical image computing and computer assisted intervention | 2003

PUPIL: Programmable Ultrasound Platform and Interface Library

Robert Rohling; Wilson W. L. Fung; Pedram Lajevardi

A new programmable ultrasound machine and software interface is described. The software interface takes advantage of the open architecture of the new machine to provide real-time access to the digital image formation pipeline and control of the parameters of acquisition. The first application of the system seeks to enhance the visibility of a needle in an image-guided procedure. The enhancement algorithm detects the needle in an ultrasound image and automatically steers the ultrasound beam in the perpendicular direction. This direction produces the strongest echoes and raises the contrast of the needle in subsequent images. The results show improved visibility of a needle in both phantoms and real tissue. The results also demonstrate the flexibility and performance of the system and its suitability for a wide range of research.

Explore More