Samuel McCauley
Stony Brook University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Samuel McCauley.
acm symposium on parallel algorithms and architectures | 2016
Michael A. Bender; Erik D. Demaine; Roozbeh Ebrahimi; Jeremy T. Fineman; Rob Johnson; Andrea Lincoln; Jayson Lynch; Samuel McCauley
Memory efficiency and locality have substantial impact on the performance of programs, particularly when operating on large data sets. Thus, memory- or I/O-efficient algorithms have received significant attention both in theory and practice. The widespread deployment of multicore machines, however, brings new challenges. Specifically, since the memory (RAM) is shared across multiple processes, the effective memory-size allocated to each process fluctuates over time. This paper presents techniques for designing and analyzing algorithms in a cache-adaptive setting, where the RAM available to the algorithm changes over time. These techniques make analyzing algorithms in the cache-adaptive model almost as easy as in the external memory, or DAM model. Our techniques enable us to analyze a wide variety of algorithms --- Master-Method-style algorithms, Akra-Bazzi-style algorithms, collections of mutually recursive algorithms, and algorithms, such as FFT, that break problems of size N into subproblems of size Theta(Nc). We demonstrate the effectiveness of these techniques by deriving several results: 1. We give a simple recipe for determining whether common divide-and-conquer cache-oblivious algorithms are optimally cache adaptive. 2. We show how to bound an algorithms non-optimality. We give a tight analysis showing that a class of cache-oblivious algorithms is a logarithmic factor worse than optimal. 3. We show the generality of our techniques by analyzing the cache-oblivious FFT algorithm, which is not covered by the above theorems. Nonetheless, the same general techniques can show that it is at most O(loglog N) away from optimal in the cache adaptive setting, and that this bound is tight. These general theorems give concrete results about several algorithms that could not be analyzed using earlier techniques. For example, our results apply to Fast Fourier Transform, matrix multiplication, Jacobi Multipass Filter, and cache-oblivious dynamic-programming algorithms, such as Longest Common Subsequence and Edit Distance. Our results also give algorithm designers clear guidelines for creating optimally cache-adaptive algorithms.
acm symposium on parallel algorithms and architectures | 2013
Michael A. Bender; David P. Bunde; Vitus J. Leung; Samuel McCauley; Cynthia A. Phillips
Integrated Stockpile Evaluation (ISE) is a program to test nuclear weapons periodically. Tests are performed by machines that may require occasional calibration. These calibrations are expensive, so finding a schedule that minimizes calibrations allows more testing to be done for a given amount of money. This paper introduces a theoretical framework for ISE. Machines run jobs with release times and deadlines. Calibrating a machine requires unit cost. The machine remains calibrated for T time steps, after which it must be recalibrated before it can resume running jobs. The objective is to complete all jobs while minimizing the number of calibrations. The paper gives several algorithms to solve the ISE problem for the case where jobs have unit processing times. For one available machine, there is an optimal polynomial-time algorithm. For multiple machines, there is a 2-approximation algorithm, which finds an optimal solution when all jobs have distinct deadlines.
symposium on principles of database systems | 2016
Michael A. Bender; Jonathan W. Berry; Rob Johnson; Thomas M. Kroeger; Samuel McCauley; Cynthia A. Phillips; Bertrand Simon; Shikha Singh; David Zage
We present history-independent alternatives to a B-tree, the primary indexing data structure used in databases. A data structure is history independent (HI) if it is impossible to deduce any information by examining the bit representation of the data structure that is not already available through the API. We show how to build a history-independent cache-oblivious B-tree and a history-independent external-memory skip list. One of the main contributions is a data structure we build on the way---a history-independent packed-memory array (PMA). The PMA supports efficient range queries, one of the most important operations for answering database queries. Our HI PMA matches the asymptotic bounds of prior non-HI packed-memory arrays and sparse tables. Specifically, a PMA maintains a dynamic set of elements in sorted order in a linear-sized array. Inserts and deletes take an amortized O(log2 N) element moves with high probability. Simple experiments with our implementation of HI PMAs corroborate our theoretical analysis. Comparisons to regular PMAs give preliminary indications that the practical cost of adding history-independence is not too large. Our HI cache-oblivious B-tree bounds match those of prior non-HI cache-oblivious B-trees. Searches take O(logB N) I/Os; inserts and deletes take O((log2 N)/B+ logB N) amortized I/Os with high probability; and range queries returning k elements take O(logB N + k/B) I/Os. Our HI external-memory skip list achieves optimal bounds with high probability, analogous to in-memory skip lists: O(logB N) I/Os for point queries and amortized O(logB N) I/Os for inserts/deletes. Range queries returning k elements run in O(logB N + k/B) I/Os. In contrast, the best possible high-probability bounds for inserting into the folklore B-skip list, which promotes elements with probability 1/B, is just Theta(log N) I/Os. This is no better than the bounds one gets from running an in-memory skip list in external memory.
computing and combinatorics conference | 2014
Michael A. Bender; Rezaul Alam Chowdhury; Pramod Ganapathi; Samuel McCauley; Yuan Tang
We define the range 1 query (R1Q) problem as follows. Given a d-dimensional (d ≥ 1) input bit matrix A, preprocess A so that for any given region \(\mathcal{R}\) of A, one can efficiently answer queries asking if \(\mathcal{R}\) contains a 1 or not. We consider both orthogonal and non-orthogonal shapes for \(\mathcal{R}\) including rectangles, axis-parallel right-triangles, certain types of polygons, and spheres. We provide space-efficient deterministic and randomized algorithms with constant query times (in constant dimensions) for solving the problem in the word RAM model. The space usage in bits is sublinear, linear, or near linear in the size of A, depending on the algorithm.
Theory of Computing Systems \/ Mathematical Systems Theory | 2018
Roozbeh Ebrahimi; Samuel McCauley; Benjamin Moseley
Online scheduling of parallelizable jobs has received a significant amount of attention recently. Scalable algorithms are known—that is, algorithms that are (1 + ε)-speed O(1)-competitive for any fixed ε>0. Previous research has focused on the case where each job’s parallelizability can be expressed as a concave speedup curve. However, there are cases where a job’s speedup curve can be convex. Considering convex speedup curves has received attention in the offline setting, but, to date, there are no positive results in the online model. In this work, we consider scheduling jobs with convex or concave speedup curves for the first time in the online setting. We give a new algorithm that is (1 + ε)-speed O(1)-competitive. There are strong lower bounds on the competitive ratio if the algorithm is not given resource augmentation over the adversary, and thus this is essentially the best positive result one can show for this setting.
international parallel and distributed processing symposium | 2017
Loris Marchal; Samuel McCauley; Bertrand Simon; Frédéric Vivien
Scientific applications are usually described as directed acyclic graphs, where nodes represent tasks and edges represent dependencies between tasks. For some applications, such as the multifrontal method of sparse matrix factorization, this graph is a tree: each task produces a single output data, used by a single task (its parent in the tree). We focus on the case when the data manipulated by tasks have a large size, which is especially the case in the multifrontal method. To process a task, both its inputs and its output must fit in the main memory. Moreover, output results of tasks have to be stored between their production and their use by the parent task. It may therefore happen, during an execution, that not all data fit together in memory. In particular, this is the case if the total available memory is smaller than the minimum memory required to process the whole tree. In such a case, some data have to be temporarily written to disk and read afterwards. These Input/Output (I/O) operations are very expensive; hence, the need to minimize them. We revisit this open problem in this paper. Specifically, our goal is to minimize the total volume of I/O while processing a given task tree. We first formalize and generalize known results, then prove that existing solutions can be arbitrarily worse than optimal. Finally, we propose a novel heuristic algorithm, based on the optimal tree traversal for memory minimization. We demonstrate good performance of this new heuristic through simulations on both synthetic trees and realistic trees built from actual sparse matrices.
latin american symposium on theoretical informatics | 2016
Michael A. Bender; Rezaul Alam Chowdhury; Alexander Conway; Pramod Ganapathi; Rob Johnson; Samuel McCauley; Bertrand Simon; Shikha Singh
We revisit classical sieves for computing primes and analyze their performance in the external-memory model. Most prior sieves are analyzed in the RAM model, where the focus is on minimizing both the total number of operations and the size of the working set. The hope is that if the working set fits in RAM, then the sieve will have good I/O performance, though such an outcome is by no means guaranteed by a small working-set size.
fun with algorithms | 2016
Michael A. Bender; Samuel McCauley; Bertrand Simon; Shikha Singh; Frédéric Vivien
This paper formalizes a resource-allocation problem that is all too familiar to the seasoned program-committee member. For each submission j that the PC member has the honor of reviewing, there is a choice. The PC member can spend the time to review submission j in detail on his/her own at a cost of Cj. Alternatively, the PC member can spend the time to identify and contact peers, hoping to recruit them as subreviewers, at a cost of 1 per subreviewer. These potential subreviewers have a certain probability of rejecting each review request, and this probability increases as time goes on. Once the PC member runs out of time or unasked experts, he/she is forced to review the paper without outside assistance. This paper gives optimal solutions to several variations of the scheduling-reviewers problem. Most of the solutions from this paper are based on an iterated log function of Cj. In particular, with k rounds, the optimal solution sends the k-iterated log of Cj requests in the first round, the (k − 1)-iterated log in the second round, and so forth. One of the contributions of this paper is solving this problem exactly, even when rejection probabilities may increase. Naturally, PC members must make an integral number of subreview requests. This paper gives, as an intermediate result, a linear-time algorithm to transform the artificial problem in which one can send fractional requests into the less-artificial problem in which one sends an integral number of requests. Finally, this paper considers the case where the PC member knows nothing about the probability that a potential subreviewer agrees to review the paper. This paper gives an approximation algorithm for this case, whose bounds improve as the number of rounds increases.
Theoretical Computer Science | 2016
Michael A. Bender; Rezaul Alam Chowdhury; Pramod Ganapathi; Samuel McCauley; Yuan Tang
We define the range 1 query (R1Q) problem as follows. Given a d-dimensional (d ≥ 1) input bit matrix A, preprocess A so that for any given region \(\mathcal{R}\) of A, one can efficiently answer queries asking if \(\mathcal{R}\) contains a 1 or not. We consider both orthogonal and non-orthogonal shapes for \(\mathcal{R}\) including rectangles, axis-parallel right-triangles, certain types of polygons, and spheres. We provide space-efficient deterministic and randomized algorithms with constant query times (in constant dimensions) for solving the problem in the word RAM model. The space usage in bits is sublinear, linear, or near linear in the size of A, depending on the algorithm.
international symposium on algorithms and computation | 2015
Michael A. Bender; Samuel McCauley; Andrew McGregor; Shikha Singh; Hoa T. Vu
We revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M, and output runs (contiguously sorted chunks of elements) that are as long as possible.