I-Ting Angelina Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where I-Ting Angelina Lee is active.

Explore More

Publication

Featured researches published by I-Ting Angelina Lee.

international conference on parallel architectures and compilation techniques | 2010

Using memory mapping to support cactus stacks in work-stealing runtime systems

I-Ting Angelina Lee; Silas Boyd-Wickizer; Zhiyi Huang; Charles E. Leiserson

Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a “cactus stack,” wherein a functions accesses to stack variables properly respect the functions calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the following three desirable criteria: † full interoperability with legacy or third-party serial binaries that have been compiled to use an ordinary linear stack, † a scheduler that provides near-perfect linear speedup on applications with sufficient parallelism, and † bounded and efficient use of memory for the cactus stack. We have addressed this cactus-stack problem by modifying the Linux operating system kernel to provide support for thread-local memory mapping (TLMM). We have used TLMM to reimplement the cactus stack in the open-source Cilk-5 runtime system. The Cilk-M runtime system removes the linguistic distinction imposed by Cilk-5 between serial code and parallel code, erases Cilk-5s limitation that serial code cannot call parallel code, and provides full compatibility with existing serial calling conventions. The Cilk-M runtime system provides strong guarantees on scheduler performance and stack space. Benchmark results indicate that the performance of the prototype Cilk-M 1.0 is comparable to the Cilk 5.4.6 system, and the consumption of stack space is modest.

acm symposium on parallel algorithms and architectures | 2011

Location-based memory fences

I-Ting Angelina Lee; Dmitry Vyukov

Traditional memory fences are program-counter (PC) based. That is, a memory fence enforces a serialization point in the program instruction stream --- it ensures that all memory references before the fence in the program order have taken effect before the execution continues onto instructions after the fence. Such PC-based memory fences always cause the processor to stall, even when the synchronization is unnecessary during a particular execution. We propose the concept of location-based memory fences, which aim to reduce the cost of synchronization due to the latency of memory fence execution in parallel algorithms. Unlike a PC-based memory fence, a location-based memory fence serializes the instruction stream of the executing thread T1 only when a different thread T2 attempts to read the memory location which is guarded by the location-based memory fence. In this work, we describe a hardware mechanism for location-based memory fences, prove its correctness, and evaluate its potential performance benefit. Our experimental results are based on a software simulation of the proposed location-based memory fence, and thus expected to incur higher overhead than the proposed hardware mechanism would. Nevertheless, our software experiments show that applications can benefit from using location-based memory fences, but they do not scale as well in some cases, due to the software overhead. These results suggest that a hardware support for location-based memory fences is worth considering.

acm symposium on parallel algorithms and architectures | 2015

The Cilkprof Scalability Profiler

Tao B. Schardl; Bradley C. Kuszmaul; I-Ting Angelina Lee; William M. Leiserson; Charles E. Leiserson

Cilkprof is a scalability profiler for multithreaded Cilk computations. Unlike its predecessor Cilkview, which analyzes only the whole-program scalability of a Cilk computation, Cilkprof collects work (serial running time) and span (critical-path length) data for each call site in the computation to assess how much each call site contributes to the overall work and span. Profiling work and span in this way enables a programmer to quickly diagnose scalability bottlenecks in a Cilk program. Despite the detail and quantity of information required to collect these measurements, Cilkprof runs with only constant asymptotic slowdown over the serial running time of the parallel computation. As an example of Cilkprofs usefulness, we used Cilkprof to diagnose a scalability bottleneck in an 1800-line parallel breadth-first search (PBFS) code. By examining Cilkprofs output in tandem with the source code, we were able to zero in on a call site within the PBFS routine that imposed a scalability bottleneck. A minor code modification then improved the parallelism of PBFS by a factor of 5. Using Cilkprof, it took us less than two hours to find and fix a scalability bug which had, until then, eluded us for months. This paper describes the Cilkprof algorithm and proves theoretically using an amortization argument that Cilkprof incurs only constant overhead compared with the applications native serial running time. Cilkprof was implemented by compiler instrumentation, that is, by modifying the LLVM compiler to insert instrumentation into user programs. On a suite of 16 application benchmarks, Cilkprof incurs a geometric-mean multiplicative overhead of only 1.9 and a maximum multiplicative overhead of only 7.4 compared with running the benchmarks without instrumentation.

acm symposium on parallel algorithms and architectures | 2012

Memory-mapping support for reducer hyperobjects

I-Ting Angelina Lee; Aamir Shafi; Charles E. Leiserson

Reducer hyperobjects (reducers) provide a linguistic abstraction for dynamic multithreading that allows different branches of a parallel program to maintain coordinated local views of the same nonlocal variable. In this paper, we investigate how thread-local memory mapping (TLMM) can be used to improve the performance of reducers. Existing concurrency platforms that support reducer hyperobjects, such as Intel Cilk Plus and Cilk++, take a hypermap approach in which a hash table is used to map reducer objects to their local views. The overhead of the hash table is costly --- roughly 12x overhead compared to a normal L1-cache memory access on an AMD Opteron 8354. We replaced the Intel Cilk Plus runtime system with our own Cilk-M runtime system which uses TLMM to implement a reducer mechanism that supports a reducer lookup using only two memory accesses and a predictable branch, which is roughly a 3x overhead compared to an ordinary L1-cache memory access. An empirical evaluation shows that the Cilk-M memory-mapping approach is close to 4x faster than the Cilk Plus hypermap approach. Furthermore, the memory-mapping approach admits better locality than the hypermap approach during parallel execution, which allows an application using reducers to scale better.

acm symposium on parallel algorithms and architectures | 2016

Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs

Robert Utterback; Kunal Agrawal; Jeremy T. Fineman; I-Ting Angelina Lee

If a parallel program has determinacy race(s), different schedules can result in memory accesses that observe different values --- various race-detection tools have been designed to find such bugs. A key component of race detectors is an algorithm for series-parallel (SP) maintenance, which identifies whether two accesses are logically parallel. This paper describes an asymptotically optimal algorithm, called WSP-Order, for performing SP maintenance in programs with fork-join (or nested) parallelism. Given a fork-join program with T1 work and T∞ span, WSP-Order executes it while also maintaining SP relationships in O(T1/P + T∞) time on P processors, which is asymptotically optimal. At the heart of WSP-Order is a work-stealing scheduler designed specifically for SP maintenance. We also implemented C-RACER, a race-detector based on WSP-Order within the Cilk Plus runtime system, and evaluated its performance on five benchmarks. Empirical results demonstrate that when run sequentially, it performs almost as well as previous best sequential race detectors. More importantly, when run in parallel, it achieves almost as much speedup as the original program without race-detection.

acm symposium on parallel algorithms and architectures | 2015

Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects

I-Ting Angelina Lee; Tao B. Schardl

A multithreaded Cilk program that is ostensibly deterministic may nevertheless behave nondeterministically due to programming errors in the code. For a Cilk program that uses reducers, a general reduction mechanism supported in various Cilk dialects, such programming errors are especially challenging to debug, because the errors can expose the nondeterminism in how the Cilk runtime system manages a reducer. We identify two unique types of races that arise from incorrect use of reducers in a Cilk program and present two algorithms to catch them. The first algorithm, called the Peer-Set algorithm, detects view-read races, which occur when the program attempts to retrieve a value out of a reducer when the read may result a nondeterministic value, such as before all previously spawned subcomputations that might update the reducer have necessarily returned. The second algorithm, called the SP+ algorithm, detects determinacy races, instances where a write to memory location occurs logically in parallel with another access to that location, even when the raced-on memory locations relate to reducers. Both algorithms are provably correct, asymptotically efficient, and can be implemented efficiently in practice. We have implemented both algorithms in our prototype race detector, Rader. When running Peer-Set, Rader incurs a geometric-mean multiplicative overhead of 2.32 over running the benchmark without instrumentation. When running SP+, Rader incurs a geometric-mean multiplicative overhead of 16.76.

acm sigplan symposium on principles and practice of parallel programming | 2016

Work stealing for interactive services to meet target latency

Jing Li; Kunal Agrawal; Sameh Elnikety; Yuxiong He; I-Ting Angelina Lee; Chenyang Lu; Kathryn S. McKinley

Interactive web services increasingly drive critical business workloads such as search, advertising, games, shopping, and finance. Whereas optimizing parallel programs and distributed server systems have historically focused on average latency and throughput, the primary metric for interactive applications is instead consistent responsiveness, i.e., minimizing the number of requests that miss a target latency. This paper is the first to show how to generalize work-stealing, which is traditionally used to minimize the makespan of a single parallel job, to optimize for a target latency in interactive services with multiple parallel requests. We design a new adaptive work stealing policy, called tail-control, that reduces the number of requests that miss a target latency. It uses instantaneous request progress, system load, and a target latency to choose when to parallelize requests with stealing, when to admit new requests, and when to limit parallelism of large requests. We implement this approach in the Intel Thread Building Block (TBB) library and evaluate it on real-world workloads and synthetic workloads. The tail-control policy substantially reduces the number of requests exceeding the desired target latency and delivers up to 58% relative improvement over various baseline policies. This generalization of work stealing for multiple requests effectively optimizes the number of requests that complete within a target latency, a key metric for interactive services.

parallel computing | 2015

On-the-Fly Pipeline Parallelism

I-Ting Angelina Lee; Charles E. Leiserson; Tao B. Schardl; Zhunping Zhang; Jim Sukha

Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this article investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP ≤ T1/P+O(T∞+lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as “lazy enabling” and “dependency folding.” We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

acm sigplan symposium on principles and practice of parallel programming | 2018

Efficient parallel determinacy race detection for two-dimensional dags

Yifan Xu; I-Ting Angelina Lee; Kunal Agrawal

A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program since they lead to non-deterministic program behavior --- different schedules of the program can lead to different results. Most prior work on detecting these races focuses on a subclass of programs with fork-join parallelism. This paper presents a race-detection algorithm, 2D-Order, for detecting races in a more general class of programs, namely programs whose dependence structure can be represented as planar dags embedded in 2D grids. Such dependence structures arise from programs that use pipelined parallelism or dynamic programming recurrences. Given a computation with T1 work and T∞ span, 2D-Order executes it while also detecting races in O(T1/P + T∞) time on P processors, which is asymptotically optimal. We also implemented PRacer, a race-detection algorithm based on 2D-Order for Cilk-P, which is a language for expressing pipeline parallelism. Empirical results demonstrate that PRacer incurs reasonable overhead and exhibits scalability similar to the baseline (executions without race detection) when running on multiple cores.

acm symposium on parallel algorithms and architectures | 2010

Brief announcement: serial-parallel reciprocity in dynamic multithreaded languages

Kunal Agrawal; I-Ting Angelina Lee; Jim Sukha

In dynamically multithreaded platforms that employ work stealing, there appears to be a fundamental tradeoff between providing provably good time and space bounds and supporting SP-reciprocity, the property of allowing arbitrary calling between parallel and serial code, including legacy serial binaries. Many known dynamically multithreaded platforms either fail to support SP-reciprocity or sacrifice on the provable time and space bounds that an efficient work-stealing scheduler could otherwise guarantee. We describe PR-Cilk, a design of a runtime system that supports SP-reciprocity in PR-Cilk and provides provable bounds on time and space. In order to maintain the space bound, PR-Cilk uses subtree-restricted work stealing. We show that with subtree-restricted work stealing, PR-Cilk provides the same guarantee on stack space usage as ordinary Cilk. The completion time guaranteed by PR-Cilk is slightly worse than ordinary Cilk. Nevertheless, if the number of times a C function calls a Cilk function is small, or if each Cilk function called by a C function is sufficiently parallel, PR-Cilk still guarantees linear speedup.

Explore More