Andrew A. Davidson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrew A. Davidson is active.

Explore More

Publication

Featured researches published by Andrew A. Davidson.

international parallel and distributed processing symposium | 2014

Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths

Andrew A. Davidson; Sean Baxter; Michael Garland; John D. Owens

Finding the shortest paths from a single source to all other vertices is a fundamental method used in a variety of higher-level graph algorithms. We present three parallel friendly and work-efficient methods to solve this Single-Source Shortest Paths (SSSP) problem: Work front Sweep, Near-Far and Bucketing. These methods choose different approaches to balance the trade off between saving work and organizational overhead. In practice, all of these methods do much less work than traditional Bellman-Ford methods, while adding only a modest amount of extra work over serial methods. These methods are designed to have a sufficient parallel workload to fill modern massively-parallel machines, and select reorganizational schemes that map well to these architectures. We show that in general our Near-Far method has the highest performance on modern GPUs, outperforming other parallel methods. We also explore a variety of parallel load-balanced graph traversal strategies and apply them towards our SSSP solver. Our work-saving methods always outperform a traditional GPU Bellman-Ford implementation, achieving rates up to 14x higher on low-degree graphs and 340x higher on scale free graphs. We also see significant speedups (20-60x) when compared against a serial implementation on graphs with adequately high degree.

international conference on computer graphics and interactive techniques | 2011

Efficient maximal poisson-disk sampling

Mohamed S. Ebeida; Andrew A. Davidson; Anjul Patney; Patrick M. Knupp; Scott A. Mitchell; John D. Owens

We solve the problem of generating a uniform Poisson-disk sampling that is both maximal and unbiased over bounded non-convex domains. To our knowledge this is the first provably correct algorithm with time and space dependent only on the number of points produced. Our method has two phases, both based on classical dart-throwing. The first phase uses a background grid of square cells to rapidly create an unbiased, near-maximal covering of the domain. The second phase completes the maximal covering by calculating the connected components of the remaining uncovered voids, and by using their geometry to efficiently place unbiased samples that cover them. The second phase converges quickly, overcoming a common difficulty in dart-throwing methods. The deterministic memory is O(n) and the expected running time is O(n log n), where n is the output size, the number of points in the final sample. Our serial implementation verifies that the log n dependence is minor, and nearly O(n) performance for both time and memory is achieved in practice. We also present a parallel implementation on GPUs to demonstrate the parallel-friendly nature of our method, which achieves 2.4x the performance of our serial version.

Computer Graphics Forum | 2012

A Simple Algorithm for Maximal Poisson-Disk Sampling in High Dimensions

Mohamed S. Ebeida; Scott A. Mitchell; Anjul Patney; Andrew A. Davidson; John D. Owens

We provide a simple algorithm and data structures for d‐dimensional unbiased maximal Poisson‐disk sampling. We use an order of magnitude less memory and time than the alternatives. Our results become more favorable as the dimension increases. This allows us to produce bigger samplings. Domains may be non‐convex with holes. The generated point cloud is maximal up to round‐off error. The serial algorithm is provably bias‐free. For an output sampling of size n in fixed dimension d, we use a linear memory budget and empirical θ(n) runtime. No known methods scale well with dimension, due to the “curse of dimensionality.” The serial algorithm is practical in dimensions up to 5, and has been demonstrated in 6d. We have efficient GPU implementations in 2d and 3d. The algorithm proceeds through a finite sequence of uniform grids. The grids guide the dart throwing and track the remaining disk‐free area. The top‐level grid provides an efficient way to test if a candidate dart is disk‐free. Our uniform grids are like quadtrees, except we delay splits and refine all leaves at once. Since the quadtree is flat it can be represented using very little memory: we just need the indices of the active leaves and a global level. Also it is very simple to sample from leaves with uniform probability.

international parallel and distributed processing symposium | 2011

An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

Andrew A. Davidson; Yao Zhang; John D. Owens

We present a multi-stage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of on-chip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on-chip. The multi-stage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an auto-tuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that auto-tuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our multi-stage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6-11x.

parallel computing | 2010

Toward techniques for auto-tuning GPU algorithms

Andrew A. Davidson; John D. Owens

We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We work towards a general framework for creating auto-tuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning a set of algorithms with a variety of computational patterns, with the goal in mind of building a general framework from these results. Our tuning strategy focuses first on identifying the computational patterns an algorithm shows, and then reducing our tuning model based on these observed patterns.

Computer-aided Design | 2011

Efficient and good Delaunay meshes from random points

Mohamed S. Ebeida; Scott A. Mitchell; Andrew A. Davidson; Anjul Patney; Patrick M. Knupp; John D. Owens

We present a Conforming Delaunay Triangulation (CDT) algorithm based on maximal Poisson disk sampling. Points are unbiased, meaning the probability of introducing a vertex in a disk-free subregion is proportional to its area, except in a neighborhood of the domain boundary. In contrast, Delaunay refinement CDT algorithms place points dependent on the geometry of empty circles in intermediate triangulations, usually near the circle centers. Unconstrained angles in our mesh are between 30? and 120?, matching some biased CDT methods. Points are placed on the boundary using a one-dimensional maximal Poisson disk sampling. Any triangulation method producing angles bounded away from 0? and 180? must have some bias near the domain boundary to avoid placing vertices infinitesimally close to the boundary.Random meshes are preferred for some simulations, such as fracture simulations where cracks must follow mesh edges, because deterministic meshes may introduce non-physical phenomena. An ensemble of random meshes aids simulation validation. Poisson-disk triangulations also avoid some graphics rendering artifacts, and have the blue-noise property.We mesh two-dimensional domains that may be non-convex with holes, required points, and multiple regions in contact. Our algorithm is also fast and uses little memory. We have recently developed a method for generating a maximal Poisson distribution of n output points, where n = ? ( Area / r 2 ) and r is the sampling radius. It takes O ( n ) memory and O ( n log n ) expected time; in practice the time is nearly linear. This, or a similar subroutine, generates our random points. Except for this subroutine, we provably use O ( n ) time and space. The subroutine gives the location of points in a square background mesh. Given this, the neighborhood of each point can be meshed independently in constant time. These features facilitate parallel and GPU implementations. Our implementation works well in practice as illustrated by several examples and comparison to Triangle. Highlights? Conforming Delaunay triangulation algorithm based on maximal Poisson-disk sampling. ? Angles between 30? and 120?. ? Two-dimensional non-convex domains with holes, planar straight-line graphs. ? O ( n ) space, E ( n log n ) time; efficient in practice. Background squares ensure all computations are local.

high performance graphics | 2012

High-quality parallel depth-of-field using line samples

Stanley Tzeng; Anjul Patney; Andrew A. Davidson; Mohamed S. Ebeida; Scott A. Mitchell; John D. Owens

We present a parallel method for rendering high-quality depth-of-field effects using continuous-domain line samples, and demonstrate its high performance on commodity GPUs. Our method runs at interactive rates and has very low noise. Our exploration of the problem carefully considers implementation alternatives, and transforms an originally unbounded storage requirement to a small fixed requirement using heuristics to maintain quality. We also propose a novel blur-dependent level-of-detail scheme that helps accelerate rendering without undesirable artifacts. Our method consistently runs 4 to 5x faster than an equivalent point sampler with better image quality. Our method draws parallels to related work in rendering multi-fragment effects.

acm sigplan symposium on principles and practice of parallel programming | 2016

GPU multisplit

Saman Ashkiani; Andrew A. Davidson; Ulrich Meyer; John D. Owens

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses. On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0-6.7x (4.4-8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s.

ACM Transactions on Graphics | 2014

k -d Darts: Sampling by k -dimensional flat searches

Mohamed S. Ebeida; Anjul Patney; Scott A. Mitchell; Keith R. Dalbey; Andrew A. Davidson; John D. Owens

We formalize sampling a function using k-d darts. A k-d Dart is a set of independent, mutually orthogonal, k-dimensional hyperplanes called k-d flats. A dart has d choose k flats, aligned with the coordinate axes for efficiency. We show k-d darts are useful for exploring a functions properties, such as estimating its integral, or finding an exemplar above a threshold. We describe a recipe for converting some algorithms from point sampling to k-d dart sampling, if the function can be evaluated along a k-d flat. We demonstrate that k-d darts are more efficient than point-wise samples in high dimensions, depending on the characteristics of the domain: for example, the subregion of interest has small volume and evaluating the function along a flat is not too expensive. We present three concrete applications using line darts (1-d darts): relaxed maximal Poisson-disk sampling, high-quality rasterization of depth-of-field blur, and estimation of the probability of failure from a response surface for uncertainty quantification. Line darts achieve the same output fidelity as point sampling in less time. For Poisson-disk sampling, we use less memory, enabling the generation of larger point distributions in higher dimensions. Higher-dimensional darts provide greater accuracy for a particular volume estimation problem.

GPU Computing Gems Jade Edition | 2012

A Hybrid Method for Solving Tridiagonal Systems on the GPU

Yao Zhang; Jonathan Cohen; Andrew A. Davidson; John D. Owens

Publisher Summary Tridiagonal linear systems are of importance to many problems in numerical analysis and computational fluid dynamics, as well as to computer graphics applications in video games and computer-animated films. Typical applications require solving hundreds or thousands of tridiagonal systems, which takes a majority part of total computation time. Fast parallel solutions are critical to larger scientific simulations, interactive computations of special effects in films, and real-time applications in video games. This chapter describes the performance of multiple tridiagonal algorithms on a graphics processing units (GPU). It provides design that is a novel hybrid algorithm which combines a work-efficient algorithm with a step efficient algorithm in a way well-suited for a GPU architecture. Hybrid solver achieves 8× and 2× speed-up, respectively, in single precision and double precision over a multithreaded highly-optimized CPU solver, and a 2×–2.3× speedup over a basic GPU solver. In the future this can be used to handle non–power-of-two system sizes; effectively support a system size larger than 1024 and design solutions that can partially take advantage of shared memory even though the entire system cannot fit into shared memory.

Explore More