David S. Wise | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David S. Wise is active.

Explore More

Publication

Featured researches published by David S. Wise.

acm sigplan symposium on principles and practice of parallel programming | 1997

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Jeremy D. Frens; David S. Wise

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturers hand-tuned BLAS3 routines; it can win.The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGIs C compilers that merrily unroll loops for the super-scalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.

Archive | 1981

Compact Layouts of Banyan/FFT Networks

David S. Wise

A two-layer pattern is presented for the crossover pattern that appears as the FFT signal flow graph and in many switching networks like the banyan, delta, or omega nets. It is characterized by constant path length and a regular pattern, providing uniform propagation delay and capacitance, and ameliorating design problems for VLSI implementation. These are important issues since path length grows linearly with the number of inputs to such networks, even though switching delay seems to grow only logarithmically.

european conference on parallel processing | 2000

Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free

David S. Wise

Definitions for the uniform representation of d-dimensional matrices serially in Morton-order (or Z-order) support both their use with cartesian indices, and their divide-and-conquer manipulation as quaternary trees. In the latter case, d-dimensional arrays are accessed as 2d-ary trees. This data structure is important because, at once, it relaxes serious problems of locality and latency, and the tree helps schedule multiprocessing. It enables algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment. This paper gathers the properties of Morton order and its mappings to other indexings, and outlines for compiler support of it. Statistics elsewhere show that the new ordering and block algorithms achieve high flop rates and, indirectly, parallelism without any low-level tuning.

international conference on supercomputing | 2007

Representation-transparent matrix algorithms with scalable performance

David S. Wise; Michael Adams

Positive results from new object-oriented tools for scientific programming are reported. Using template classes, abstractions of matrix representations are available that subsume conventional row-major, column-major, either Z- or И-Morton-order, as well as block-wise combinations of these. Moreover, the design of the Matrix Template Library (MTL) has been independently extended to provide recursators, to support block-recursive algorithms, supplementing MTLs iterators. Data types modeling both concepts enable the programmer to implement both iterative and recursive algorithms (or even both) on all of the aforementioned matrix representations at once for a wide family of important scientific operations. We illustrate the unrestricted applicability of our matrix-recursator on matrix multiplication. The same generic block-recursive function, unaltered, is instantiated on different triplets of matrix types. Within a base block, either a library multiplication or a user-provided, platform-specific code provides excellent performance. We achieve 77% of peak-performance using hand-tuned base cases without explicit prefetching. This excellent performance becomes available over a wide family of matrix representations from a single program. The techniques generalize to other applications in linear algebra.

acm sigplan symposium on principles and practice of parallel programming | 2001

Language support for Morton-order matrices

David S. Wise; Jeremy D. Frens; Yuhong Gu; Gregory A. Alexander

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment. We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACKs reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers. We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines. Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.

acm sigplan symposium on principles and practice of parallel programming | 2003

QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Jeremy D. Frens; David S. Wise

Quadtree matrices using Morton-order storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for transforming the code. Furthermore, the divide-and-conquer algorithm breaks problems down into independent computations. These independent computations can be dispatched in parallel for straightforward parallel processing.Proof-of-concept is given by an algorithm for QR factorization based on Givens rotations for quadtree matrices in Morton-order storage. The algorithms deliver positive results, competing with and even beating the LAPACK equivalent.

web science | 2008

Converting to and from Dilated Integers

Rajeev Raman; David S. Wise

Dilated integers form an ordered group of the Cartesian indices into a d-dimensional array represented in the Morton order. Efficient implementations of its operations can be found elsewhere. This paper offers efficient casting (type)conversions to and from an ordinary integer representation. As the Morton order representation for 2D and 3D arrays attracts more users because of its excellent block locality, the efficiency of these conversions becomes important. They are essential for programmers who would use Cartesian indexing there. Two algorithms for each casting conversion are presented here, including to-and-from dilated integers for both d = 2 and d = 3. They fall into two families. One family uses newly compact table lookup, so the cache capacity is better preserved. The other generalizes better to all d, using processor-local arithmetic that is newly presented as abstract d-ary and (d - 1)-ary recurrences. Test results for two and three dimensions generally favor the former.

Bit Numerical Mathematics | 1977

The one-bit reference count

David S. Wise; Daniel P. Friedman

Deutsch and Bobrow propose a storage reclamation scheme for a heap which is a hybrid of garbage collection and reference counting. The point of the hybrid scheme is to keep track of very low reference counts between necessary invocation of garbage collection so that nodes which are allocated and rather quickly abandoned can be returned to available space, delaying necessity for garbage collection. We show how such a scheme may be implemented using the mark bit already required in every node by the garbage collector. Between garbage collections that bit is used to distinguish nodes with a reference count known to be one. A significant feature of our scheme is a small cache of references to nodes whose implemented counts “ought to be higher” which prevents the loss of logical count information in simple manipulations of uniquely referenced structures.

Journal of Parallel and Distributed Computing | 1990

Costs of quadtree representation of nondense matrices

David S. Wise; John V. Franco

Abstract The quadtree representation of matrices is a uniform representation for both sparse and dense matrices which can facilitate shared manipulation on multiprocessors. This paper presents worst-case and average-case resource requirements for storing and retrieving familiar families of patterned matrices: packed, symmetric, triangular, Toeplitz, and banded. Using this representation it compares resource requirements of three kinds of permutation matrices, as examples of nondense, unpatterned matrices. Exact values for the shuffle and bit-reversal permutations (as in the fast Fourier transform) and tight bounds on the expected values from purely random permutations are derived. Two different measures, density and sparsity , are proposed from these values. Analysis of quadtree matrix addition relates density of addends to space bounds on their sum and relates their sparsity to time bounds for computing that sum.

international symposium on symbolic and algebraic computation | 1988

Experiments with Quadtree Representation of Matrices

S. Kamal Abdali; David S. Wise

The quadtrees matrix representation has been recently proposed as an alternative to the conventional linear storage of matrices. If all elements of a matrix are zero, then the matrix is represented by an empty tree; otherwise it is represented by a tree consisting of four subtrees, each representing, recursively, a quadrant of the matrix. Using four-way block decomposition, algorithms on quadtrees accelerate on blocks entirely of zeroes, and thereby offer improved performance on sparse matrices. This paper reports the results of experiments done with a quadtree matrix package implemented in REDUCE to compare the performance of quadtree representation with REDUCEs built-in sequential representation of matrices. Tests on addition, multiplication, and inversion of dense, triangular, tridiagonal, and diagonal matrices (both symbolic and numeric) of sizes up to 100×100 show that the quadtree algorithms perform well in a broad range of circumstances, sometimes running orders of magnitude faster than their sequential counterparts.

Explore More