Thomas H. Cormen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas H. Cormen is active.

Explore More

Publication

Featured researches published by Thomas H. Cormen.

SIAM Journal on Computing | 1999

Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems

Thomas H. Cormen; Thomas Sundquist; Leonard F. Wisniewski

This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data. The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations. Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein.

parallel computing | 1996

Early Experiences in Evaluating the Parallel Disk Model with the ViC* Implementation

Thomas H. Cormen; Melissa Hirschl

Although several algorithms have been developed for the Parallel Disk Model (PDM), few have been implemented. Consequently, little has been known about the accuracy of the PDM in measuring I/O time and total time to perform an out-of-core computation. This paper analyzes timing results on a uniprocessor with several disks for two PDM algorithms, out-of-core radix sort and BMMC permutations, to determine the strengths and weaknesses of the PDM. The results indicate the following. First, good PDM algorithms are usually not I/O bound. Second, of the four PDM parameters, two (problem size and memory size) are good indicators of I/O time and running time, but the other two (block size and number of disks) are not. Third, because PDM algorithms tend not to be I/O bound, asynchronous I/O effectively hides I/O times. The software interface to the PDM is part of the ViC* run-time library. The interface is a set of wrappers that are designed to be both efficient and portable across several parallel file systems and target machines.

parallel computing | 1998

Performing Out-of-Core FFTs on Parallel Disk Systems

Thomas H. Cormen; David M. Nicol

Abstract The Fast Fourier Transform (FFT) plays a key role in many areas of computational science and engineering. Although most one-dimensional FFT problems can be solved entirely in main memory, some important classes of applications require out-of-core techniques. For these, use of parallel I/O systems can improve performance considerably. This paper shows how to perform one-dimensional FFTs using a parallel disk system with independent disk accesses. We present both analytical and experimental results for performing out-of-core FFTs in two ways: using traditional virtual memory with demand paging, and using a provably asymptotically optimal algorithm for the Parallel Disk Model (PDM) of Vitter and Shriver. When run on a DEC 2100 server with a large memory and eight parallel disks, the optimal algorithm for the PDM runs up to 144.7 times faster than in-core methods under demand paging. Moreover, even including I/O costs, the normalized times for the optimal PDM algorithm are competitive, or better than, those for in-core methods even when they run entirely in memory.

Journal of Parallel and Distributed Computing | 1993

Fast permuting on disk arrays

Thomas H. Cormen

Abstract We present fast algorithms to accomplish common classes of permutations on parallel disk systems. Vitter and Shriver introduced a parallel I/O model and proved an asymptotically tight bound on the number of parallel I/Os needed to perform a general permutation. They demonstrated, however, that at least one type of permutation-matrix transpose-can be performed with fewer parallel I/Os than the lower bound for general permutations. This paper generalizes the Vitter-Shriver matrix-transpose result, showing that other classes of permutations can be performed with fewer parallel I/Os than the general permutation bound in many cases. We show how to perform bit-permute/complement (BPC) permutations, a class including matrix transpose and many other common permutations, with fewer parallel I/Os than general permutations. We also present a fast algorithm to perform bit-matrix-multiply/complement (BMMC) permutations. The algorithms for these permutations are built from restricted classes of permutations that we define, each requiring only one pass over the data. All the permutation algorithms presented in this paper are deterministic and are easily performed on-line.

acm symposium on parallel algorithms and architectures | 2001

Columnsort lives! an efficient out-of-core sorting program

Geeta Chaudhry; Thomas H. Cormen; Leonard F. Wisniewski

We present the design and implementation of a parallel out-of-core sorting algorithm, which is based on Leightons columnsort algorithm. We show how to relax some of the steps of the original columnsort algorithm to permit a faster out-of-core implementation. Our algorithm requires only 4 passes over the data, and a 3-pass implementation is possible. Although there is a limit on the number of records that can be sorted—as a function of the memory used per processor—this upper limit need not be a severe restriction, and it increases superlinearly with the per-processor memory. To the best of our knowledge, our implementation is the first out-of-core multiprocessor sorting algorithm whose output is in the order assumed by the Parallel Disk Model. We define several measures of sorting efficiency and demonstrate that our implementations sorting efficiency is competitive with that of NOW-Sort, a sorting algorithm developed to sort large amounts of data quickly on a cluster of workstations.

algorithm engineering and experimentation | 2002

Getting More from Out-of-Core Columnsort

Geeta Chaudhry; Thomas H. Cormen

We describe two improvements to a previous implementation of out-of-core columnsort, in which data reside on multiple disks. The first improvement replaces asynchronous I/O and communication calls by synchronous calls within a threaded framework. Experimental runs show that this improvement reduces the running time to approximately half of the running time of the previous implementation. The second improvement uses algorithmic and engineering techniques to reduce the number of passes over the data from four to three. Experimental evidence shows that this improvement yields modest performance gains. We expect that the performance gain of this second improvement increases when the relative speed of processing and communication increases with respect to disk I/O speeds. Thus, as processing and communication become faster relative to I/O, this second improvement may yield better results than it currently does.

acm symposium on parallel algorithms and architectures | 1993

Asymptotically tight bounds for performing BMMC permutations on parallel disk systems

Thomas H. Cormen; Leonard F. Wisniewski

This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF (2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vectorreversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data. The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations. Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein.

high level parallel programming models and supportive environments | 1998

ViC: a compiler for virtual-memory C

Alex Colvin; Thomas H. Cormen

The paper describes the functionality of ViC*, a compiler for a variant of the data parallel language C* with support for out-of-core data. The compiler translates C* programs with shapes declared out of core, which describe parallel data stored on disk. The compiler output is a SPMD style program in standard C with I/O and library calls added to efficiently access out-of-core parallel data. The ViC* compiler also applies several program transformations to improve out-of-core data access.

Archive | 1997

Determining an Out-of-Core FFT Decomposition Strategy for Parallel Disks by Dynamic Programming

Thomas H. Cormen

We present an out-of-core FFT algorithm based on the in-core FFT method developed by Swarztrauber. Our algorithm uses a recursive divide-and-conquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the algorithm’s I/O complexity on the Parallel Disk Model and show how to use dynamic programming to determine optimal splits at each recursive stage. The algorithm to determine the optimal splits takes only Θ(lg2 N) time for an N-point FFT, and it is practical. The out-of-core FFT algorithm itself takes considerably longer.

international parallel and distributed processing symposium | 2005

Building on a framework: using FG for more flexibility and improved performance in parallel programs

Elena Riccio Davidson; Thomas H. Cormen

We describe new features of FG that are designed to improve performance and extend the range of computations that fit into its framework. FG (short for framework generator) is a programming environment for parallel programs running on clusters. It was originally designed to mitigate latency in accessing data by running a program as a series of asynchronous stages that operate on buffers in a linear pipeline. To improve performance, FG now allows stages to be replicated, either statically by the programmer or dynamically by FG itself. FG also now alters thread priorities to use resources more efficiently; again, this action may be initiated by either the programmer or FG. To extend the range of computations that fit into its framework, FG now incorporates fork-join and DAG structures. Not only do fork-join and DAG structures allow for more programs to be designed for FG, but they also can enable significant performance improvements over linear pipeline structures.

Explore More