Katherine E. Coons | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Katherine E. Coons is active.

Explore More

Publication

Featured researches published by Katherine E. Coons.

architectural support for programming languages and operating systems | 2009

An evaluation of the TRIPS computer system

Mark Gebhart; Bertrand A. Maher; Katherine E. Coons; Jeffrey R. Diamond; Paul V. Gratz; Mario Marino; Nitya Ranganathan; Behnam Robatmili; Aaron Smith; James H. Burrill; Stephen W. Keckler; Doug Burger; Kathryn S. McKinley

The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine concurrency for high performance while tolerating emerging technology scaling challenges, such as increasing wire delays and power consumption. This paper evaluates how well TRIPS meets this goal through a detailed ISA and performance analysis. We compare performance, using cycles counts, to commercial processors. On SPEC CPU2000, the Intel Core 2 outperforms compiled TRIPS code in most cases, although TRIPS matches a Pentium 4. On simple benchmarks, compiled TRIPS code outperforms the Core 2 by 10% and hand-optimized TRIPS code outperforms it by factor of 3. Compared to conventional ISAs, the block-atomic model provides a larger instruction window, increases concurrency at a cost of more instructions executed, and replaces register and memory accesses with more efficient direct instruction-to-instruction communication. Our analysis suggests ISA, microarchitecture, and compiler enhancements for addressing weaknesses in TRIPS and indicates that EDGE architectures have the potential to exploit greater concurrency in future technologies.

architectural support for programming languages and operating systems | 2006

A spatial path scheduling algorithm for EDGE architectures

Katherine E. Coons; Xia Chen; Doug Burger; Kathryn S. McKinley; Sundeep K. Kushwaha

Growing on-chip wire delays are motivating architectural features that expose on-chip communication to the compiler. EDGE architectures are one example of communication-exposed microarchitectures in which the compiler forms dataflow graphs that specify how the microarchitecture maps instructions onto a distributed execution substrate. This paper describes a compiler scheduling algorithm called spatial path scheduling that factors in previously fixed locations - called anchor points - for each placement. This algorithm extends easily to different spatial topologies. We augment this basic algorithm with three heuristics: (1) local and global ALU and network link contention modeling, (2) global critical path estimates, and (3) dependence chain path reservation. We use simulated annealing to explore possible performance improvements and to motivate the augmented heuristics and their weighting functions. We show that the spatial path scheduling algorithm augmented with these three heuristics achieves a 21% average performance improvement over the best prior algorithm and comes within an average of 5% of the annealed performance for our benchmarks.

acm sigplan symposium on principles and practice of parallel programming | 2010

GAMBIT: effective unit testing for concurrency libraries

Katherine E. Coons; Sebastian Burckhardt; Madanlal Musuvathi

As concurrent programming becomes prevalent, software providers are investing in concurrency libraries to improve programmer productivity. Concurrency libraries improve productivity by hiding error-prone, low-level synchronization from programmers and providing higher-level concurrent abstractions. Testing such libraries is difficult, however, because concurrency failures often manifest only under particular scheduling circumstances. Current best testing practices are often inadequate: heuristic-guided fuzzing is not systematic, systematic schedule enumeration does not find bugs quickly, and stress testing is neither systematic nor fast. To address these shortcomings, we propose a prioritized search technique called GAMBIT that combines the speed benefits of heuristic-guided fuzzing with the soundness, progress, and reproducibility guarantees of stateless model checking. GAMBIT combines known techniques such as partial-order reduction and preemption-bounding with a generalized best-first search frame- work that prioritizes schedules likely to expose bugs. We evaluate GAMBITs effectiveness on newly released concurrency libraries for Microsofts .NET framework. Our experiments show that GAMBIT finds bugs more quickly than prior stateless model checking techniques without compromising coverage guarantees or reproducibility.

tools and algorithms for construction and analysis of systems | 2010

Preemption sealing for efficient concurrency testing

Thomas Ball; Sebastian Burckhardt; Katherine E. Coons; Madanlal Musuvathi; Shaz Qadeer

The choice of where a thread scheduling algorithm preempts one thread in order to execute another is essential to reveal concurrency errors such as atomicity violations, livelocks, and deadlocks. We present a scheduling strategy called preemption sealing that controls where and when a scheduler is disabled from preempting threads during program execution. We demonstrate that this strategy is effective in addressing two key problems in testing industrial-scale concurrent programs: (1) tolerating existing errors in order to find more errors, and (2) compositional testing of layered, concurrent systems. We evaluate the effectiveness of preemption sealing, implemented in the Chess tool, for these two scenarios on newly released concurrency libraries for Microsofts .NET framework.

conference on object oriented programming systems languages and applications | 2013

Bounded partial-order reduction

Katherine E. Coons; Madanlal Musuvathi; Kathryn S. McKinley

Eliminating concurrency errors is increasingly important as systems rely more on parallelism for performance. Exhaustively exploring the state-space of a programs thread interleavings finds concurrency errors and provides coverage guarantees, but suffers from exponential state-space explosion. Two prior approaches alleviate state-space explosion. (1) Dynamic partial-order reduction (DPOR) provides full coverage and explores only one interleaving of independent transitions. (2) Bounded search provides bounded coverage by enumerating interleavings that do not exceed a bound. In particular, we focus on preemption-bounding. Combining partial-order reduction with preemption-bounding had remained an open problem. We show that preemption-bounded search explores the same partial orders repeatedly and consequently explores more executions than unbounded DPOR, even for small bounds. We further show that if DPOR simply uses the preemption bound to prune the state space as it explores new partial orders, it misses parts of the state space reachable in the bound and is therefore unsound. The bound essentially induces dependences between otherwise independent transitions in the DPOR state space. We introduce Bounded Partial Order Reduction (BPOR), a modification of DPOR that compensates for bound dependences. We identify properties that determine how well bounds combine with partial-order reduction. We prove sound coverage and empirically evaluate BPOR with preemption and fairness bounds. We show that by eliminating redundancies, BPOR significantly reduces testing time compared to bounded search. BPORs faster incremental guarantees will help testers verify larger concurrent programs.

international symposium on microarchitecture | 2008

Strategies for mapping dataflow blocks to distributed hardware

Behnam Robatmili; Katherine E. Coons; Doug Burger; Kathryn S. McKinley

Distributed processors must balance communication and concurrency. When dividing instructions among the processors, key factors are the available concurrency, criticality of dependence chains, and communication penalties. The amount of concurrency determines the importance of the other factors: if concurrency is high, wider distribution of instructions is likely to tolerate the increased operand routing latencies. If concurrency is low, mapping dependent instructions close to one another is likely to reduce communication costs that contribute to the critical path. This paper explores these tradeoffs for distributed Explicit Dataflow Graph Execution (EDGE) architectures that execute blocks of dataflow instructions atomically. A runtime block mapper assigns instructions from a single thread to distributed hardware resources (cores) based on compiler-assigned instruction identifiers. We explore two approaches: fixed strategies that map all blocks to the same number of cores, and adaptive strategies that vary the number of cores for each block. The results show that best fixed strategy varies, based on the corespsila issue width. A simple adaptive strategy improves performance over the best fixed strategies for single and dual-issue cores, but its benefits decrease as the corespsila issue width increases. These results show that by choosing an appropriate runtime block mapping strategy, average performance can be increased by 18%, while simultaneously reducing average operand communication by 70%, saving energy as well as improving performance. These results indicate that runtime block mapping is a promising mechanism for balancing communication and concurrency in distributed processors.

international conference on parallel architectures and compilation techniques | 2008

Feature selection and policy optimization for distributed instruction placement using reinforcement learning

Katherine E. Coons; Behnam Robatmili; Matthew E. Taylor; Bertrand A. Maher; Doug Burger; Kathryn S. McKinley

Communication overheads are one of the fundamental challenges in a multiprocessor system. As the number of processors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) processors, in which instructions communicate with one another directly on a distributed substrate, give the compiler control over communication overheads at a fine granularity. Prior work shows that compilers can effectively reduce fine-grained communication overheads in EDGE architectures using a spatial instruction placement algorithm with a heuristic-based cost function. While this algorithm is effective, the cost function must be painstakingly tuned. Heuristics tuned to perform well across a variety of applications leave users with little ability to tune performance-critical applications, yet we find that the best placement heuristics vary significantly with the application. First, we suggest a systematic feature selection method that reduces the feature set size based on the extent to which features affect performance. To automatically discover placement heuristics, we then use these features as input to a reinforcement learning technique, called Neuro-Evolution of Augmenting Topologies (NEAT), that uses a genetic algorithm to evolve neural networks. We show that NEAT outperforms simulated annealing, the most commonly used optimization technique for instruction placement. We use NEAT to learn general heuristics that are as effective as hand-tuned heuristics, but we find that improving over highly hand-tuned general heuristics is difficult. We then suggest a hierarchical approach to machine learning that classifies segments of code with similar characteristics and learns heuristics for these classes. This approach performs closer to the specialized heuristics. Together, these results suggest that learning compiler heuristics may benefit from both improved feature selection and classification.

languages and compilers for parallel computing | 2008

Register Bank Assignment for Spatially Partitioned Processors

Behnam Robatmili; Katherine E. Coons; Doug Burger; Kathryn S. McKinley

Demand for instruction level parallelism calls for increasing register bandwidth without increasing the number of register ports. Emerging architectures address this need by partitioning registers into multiple distributed banks, which offers a technology scalable substrate but a challenging compilation target. This paper introduces a register allocator for spatially partitioned architectures. The allocator performs bank assignment together with allocation. It minimizes spill code and optimizes bank selection based on a priority function. This algorithm is unique because it must reason about multiple competing resource constraints and dependencies exposed by these architectures. We demonstrate an algorithm that uses critical path estimation, delays from registers to consuming functional units, and hardware resource constraints. We evaluate the algorithm on TRIPS, a functional, partitioned, tiled processor with register banks distributed on top of a 4 ×4 grid of ALUs. These results show that the priority banking algorithm implements a number of policies that improve performance, performance is sensitive to bank assignment, and the compiler manages this resource well.

programming language design and implementation | 2010

PACER: proportional detection of data races

Michael D. Bond; Katherine E. Coons; Kathryn S. McKinley

Archive | 2007

Software Infrastructure and Tools for the TRIPS Prototype

Bill Yoder; Jim Burrill; Robert McDonald; Kevin Bush; Katherine E. Coons; Mark Gebhart; Sibi Govindan; Bertrand A. Maher; Ramadas Nagarajan; Behnam Robatmili; Karthikeyan Sankaralingam; Sadia Sharif; Aaron Smith; Doug Burger; Stephen W. Keckler; Kathryn S. McKinley

Explore More