Is this you? Create Your Porfile

Tao B. Schardl

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tao B. Schardl is active.

Explore More

Publication

Featured researches published by Tao B. Schardl.

acm symposium on parallel algorithms and architectures | 2014

Ordering heuristics for parallel graph coloring

William C. Hasenplaugh; Tim Kaler; Tao B. Schardl; Charles E. Leiserson

This paper introduces the largest-log-degree-first (LLF) and smallest-log-degree-last (SLL) ordering heuristics for parallel greedy graph-coloring algorithms, which are inspired by the largest-degree-first (LF) and smallest-degree-last (SL) serial heuristics, respectively. We show that although LF and SL, in practice, generate colorings with relatively small numbers of colors, they are vulnerable to adversarial inputs for which any parallelization yields a poor parallel speedup. In contrast, LLF and SLL allow for provably good speedups on arbitrary inputs while, in practice, producing colorings of competitive quality to their serial analogs. We applied LLF and SLL to the parallel greedy coloring algorithm introduced by Jones and Plassmann, referred to here as JP. Jones and Plassman analyze the variant of JP that processes the vertices of a graph in a random order, and show that on an O(1)-degree graph G=(V,E), this JP-R variant has an expected parallel running time of O(lgV/lglgV) in a PRAM model. We improve this bound to show, using work-span analysis, that JP-R, augmented to handle arbitrary-degree graphs, colors a graph G=(V,E) with degree Delta using Theta(V+E) work and O(lgV+ lg Delta . min sqrt-E, Delta +lg DeltaVlglgV) expected span. We prove that JP-LLF and JP-SLL --- JP using the LLF and SLL heuristics, respectively --- execute with the same asymptotic work as JP-R and only logarithmically more span while producing higher-quality colorings than JP-R in practice. We engineered an efficient implementation of JP for modern shared-memory multicore computers and evaluated its performance on a machine with 12 Intel Core-i7 (Nehalem) processor cores. Our implementation of JP-LLF achieves a geometric-mean speedup of 7.83 on eight real-world graphs and a geometric-mean speedup of 8.08 on ten synthetic graphs, while our implementation using SLL achieves a geometric-mean speedup of 5.36 on these real-world graphs and a geometric-mean speedup of 7.02 on these synthetic graphs. Furthermore, on one processor, JP-LLF is slightly faster than a well-engineered serial greedy algorithm using LF, and likewise, JP-SLL is slightly faster than the greedy algorithm using SL.

acm symposium on parallel algorithms and architectures | 2015

The Cilkprof Scalability Profiler

Tao B. Schardl; Bradley C. Kuszmaul; I-Ting Angelina Lee; William M. Leiserson; Charles E. Leiserson

Cilkprof is a scalability profiler for multithreaded Cilk computations. Unlike its predecessor Cilkview, which analyzes only the whole-program scalability of a Cilk computation, Cilkprof collects work (serial running time) and span (critical-path length) data for each call site in the computation to assess how much each call site contributes to the overall work and span. Profiling work and span in this way enables a programmer to quickly diagnose scalability bottlenecks in a Cilk program. Despite the detail and quantity of information required to collect these measurements, Cilkprof runs with only constant asymptotic slowdown over the serial running time of the parallel computation. As an example of Cilkprofs usefulness, we used Cilkprof to diagnose a scalability bottleneck in an 1800-line parallel breadth-first search (PBFS) code. By examining Cilkprofs output in tandem with the source code, we were able to zero in on a call site within the PBFS routine that imposed a scalability bottleneck. A minor code modification then improved the parallelism of PBFS by a factor of 5. Using Cilkprof, it took us less than two hours to find and fix a scalability bug which had, until then, eluded us for months. This paper describes the Cilkprof algorithm and proves theoretically using an amortization argument that Cilkprof incurs only constant overhead compared with the applications native serial running time. Cilkprof was implemented by compiler instrumentation, that is, by modifying the LLVM compiler to insert instrumentation into user programs. On a suite of 16 application benchmarks, Cilkprof incurs a geometric-mean multiplicative overhead of only 1.9 and a maximum multiplicative overhead of only 7.4 compared with running the benchmarks without instrumentation.

acm symposium on parallel algorithms and architectures | 2014

Executing dynamic data-graph computations deterministically using chromatic scheduling

Tim Kaler; William C. Hasenplaugh; Tao B. Schardl; Charles E. Leiserson

A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertexs prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round. This paper introduces PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by PRISM to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze PRISM using work-span analysis. Let G=(V,E) be a degree-Δ graph colored with Χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)=[Q] + Σv∈Qdeg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of PRISM performs updates in Q using O(Χ(lg (Q/Χ)+lgΔ)+ lgP) span and Θ(size(Q)+Χ+P) work. These theoretical guarantees are matched by good empirical performance. We modified GraphLab to incorporate PRISM and studied seven application benchmarks on a 12-core multicore machine. PRISM executes the benchmarks 1.2–2.1 times faster than GraphLabs nondeterministic lock-based scheduler while providing deterministic behavior. This paper also presents PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is more involved, incorporating a multivector data structure to maintain an ordered set of vertices partitioned by color.

acm sigplan symposium on principles and practice of parallel programming | 2017

Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation

Tao B. Schardl; William S. Moses; Charles E. Leiserson

This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compilers intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the programs control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, we added or modified about 6000 lines of LLVMs 4-million-line codebase. Tapir enables LLVMs existing compiler optimizations for serial code -- including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination -- to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.

acm symposium on parallel algorithms and architectures | 2015

Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects

I-Ting Angelina Lee; Tao B. Schardl

A multithreaded Cilk program that is ostensibly deterministic may nevertheless behave nondeterministically due to programming errors in the code. For a Cilk program that uses reducers, a general reduction mechanism supported in various Cilk dialects, such programming errors are especially challenging to debug, because the errors can expose the nondeterminism in how the Cilk runtime system manages a reducer. We identify two unique types of races that arise from incorrect use of reducers in a Cilk program and present two algorithms to catch them. The first algorithm, called the Peer-Set algorithm, detects view-read races, which occur when the program attempts to retrieve a value out of a reducer when the read may result a nondeterministic value, such as before all previously spawned subcomputations that might update the reducer have necessarily returned. The second algorithm, called the SP+ algorithm, detects determinacy races, instances where a write to memory location occurs logically in parallel with another access to that location, even when the raced-on memory locations relate to reducers. Both algorithms are provably correct, asymptotically efficient, and can be implemented efficiently in practice. We have implemented both algorithms in our prototype race detector, Rader. When running Peer-Set, Rader incurs a geometric-mean multiplicative overhead of 2.32 over running the benchmark without instrumentation. When running SP+, Rader incurs a geometric-mean multiplicative overhead of 16.76.

parallel computing | 2015

On-the-Fly Pipeline Parallelism

I-Ting Angelina Lee; Charles E. Leiserson; Tao B. Schardl; Zhunping Zhang; Jim Sukha

Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this article investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP ≤ T1/P+O(T∞+lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as “lazy enabling” and “dependency folding.” We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

Information Processing Letters | 2016

On the efficiency of localized work stealing

Warut Suksompong; Charles E. Leiserson; Tao B. Schardl

Abstract This paper investigates a variant of the work-stealing algorithm that we call the localized work-stealing algorithm . The intuition behind this variant is that because of locality, processors can benefit from working on their own work. Consequently, when a processor is free, it makes a steal attempt to get back its own work. We call this type of steal a steal-back . We show that the expected running time of the algorithm is T 1 / P + O ( T ∞ P ) , and that under the “even distribution of free agents assumption”, the expected running time of the algorithm is T 1 / P + O ( T ∞ lg ⁡ P ) . In addition, we obtain another running-time bound based on ratios between the sizes of serial tasks in the computation. If M denotes the maximum ratio between the largest and the smallest serial tasks of a processor after removing a total of O ( P ) serial tasks across all processors from consideration, then the expected running time of the algorithm is T 1 / P + O ( T ∞ M ) .

International Journal of Computational Geometry and Applications | 2013

FOLDING EQUILATERAL PLANE GRAPHS

Zachary Abel; Erik D. Demaine; Martin L. Demaine; Sarah Eisenstat; Jayson Lynch; Tao B. Schardl; Isaac Shapiro-Ellowitz

We consider two types of folding applied to equilateral plane graph linkages. First, under continuous folding motions, we show how to reconfigure any linear equilateral tree (lying on a line) into a canonical configuration. By contrast, it is known that such reconfiguration is not always possible for linear (nonequilateral) trees and for (nonlinear) equilateral trees. Second, under instantaneous folding motions, we show that an equilateral plane graph has a noncrossing linear folded state if and only if it is bipartite. Furthermore, we show that the equilateral constraint is necessary for this result, by proving that it is strongly NP-complete to decide whether a (nonequilateral) plane graph has a linear folded state. Equivalently, we show strong NP-completeness of deciding whether an abstract metric polyhedral complex with one central vertex has a noncrossing flat folded state. By contrast, the analogous problem for a polyhedral manifold with one central vertex (single-vertex origami) is only weakly NP-complete.

parallel computing | 2016

Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling

Tim Kaler; William C. Hasenplaugh; Tao B. Schardl; Charles E. Leiserson

A data-graph computation—popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi—is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex’s prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round. This article introduces Prism, a chromatic-scheduling algorithm for executing dynamic data-graph computations. Prism uses a vertex coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by Prism to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze Prism using work-span analysis. Let G = (V, E) be a degree-Δ graph colored with χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)= |Q| + ∑v∈ Q deg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of Prism performs updates in Q using O(χ (lg ( Q/χ ) + lg Δ ) + lg P span and Θ(size(Q) + P) work. These theoretical guarantees are matched by good empirical performance. To isolate the effect of the scheduling algorithm on performance, we modified GraphLab to incorporate Prism and studied seven application benchmarks on a 12-core multicore machine. Prism executes the benchmarks 1.2 to 2.1 times faster than GraphLab’s nondeterministic lock-based scheduler while providing deterministic behavior. This article also presents Prism-R, a variation of Prism that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. Prism-R satisfies the same theoretical bounds as Prism, but its implementation is more involved, incorporating a multivector data structure to maintain a deterministically ordered set of vertices partitioned by color. Despite its additional complexity, Prism-R is only marginally slower than Prism. On the seven application benchmarks studied, Prism-R incurs a 7% geometric mean overhead relative to Prism.

Journal of Information Processing | 2013

Finding a Hamiltonian Path in a Cube with Specified Turns is Hard

Zachary Abel; Erik D. Demaine; Martin L. Demaine; Sarah Eisenstat; Jayson Lynch; Tao B. Schardl

We prove the NP-completeness of finding a Hamiltonian path in an N ×N ×N cube graph with turns exactly at specified lengths along the path. This result establishes NP-completeness of Snake Cube puzzles: folding a chain of N 3 unit cubes, joined at face centers (usually by a cord passing through all the cubes), into an N × N × N cube. Along the way, we prove a universality result that zig-zag chains (which must turn every unit) can fold into any polycube after 4×4×4 refinement, or into any Hamiltonian polycube after 2 × 2 × 2 refinement.

Explore More